Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021

InfluxData
InfluxDataInfluxData
Maksim Vazhenin
Software Sr Principal Engineer
Dell Technologies
InfluxDB for Storage
System Monitoring
Internal Use - Confidential
| Agenda
Our journey to Influxdb monitoring stack
High Availability for InfluxDB
Horizontally scalable query with Flux language
Deploy on low memory resources
How to switch monitoring stack
Dashboards…
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
Internal Use - Confidential
SITE 1 SITE 3
SITE 2
IoT
Financial
Services
Media &
Entertainment
Cloud
Backup
Archive
Modern
Apps
Evidence
Repository
Analytics
ECS
Internal Use - Confidential
ECS is deployed in physical nodes combined by racks
Node 1 Node 2 Node 3 Node 4 Node N Node N+1 Node N+2 Node N+3
Rack N
Rack 1
Datacenter
Internal Use - Confidential
ECS is deployed in Docker containers
Internal Use - Confidential
ECS internal monitoring data
Performance
System monitoring
Internal health metrics
Capacity (lots of
complicated compute
from may services)
Internal Use - Confidential
Existing monitoring solution disadvantages
Different teams
involved to show data
on UI
Code change to add
new dashboard
No flexible query
language
Slow on large queries
Internal Use - Confidential
Need for modern monitoring stack
Easy to build
dashboards
System resources
monitoring
Easy to create alerts
Autonomous service
teams
Internal Use - Confidential
Challenges
High scale (~300 nodes
clusters)
No free resources
Internal Use - Confidential
Alternatives
ELK Prometheus InfluxDB
Internal Use - Confidential
Alternatives
ELK
High resource requirements Flexible analytics
Internal Use - Confidential
Alternatives
Prometheus
High cardinality
Bad when working with rare data
Bad at counting exact values
Performance
Query language
Does not support backfilling
Internal Use - Confidential
Alternatives
Prometheus
12
15
extrapolation
17
increase(4m)
Polling interval 2m
extrapolation
Internal Use - Confidential
Alternatives
InfluxDB
High cardinality
InfluxQL Performance
Can be used for exact compute
Supports backfilling
Flux Query language
Internal Use - Confidential
Alternatives
ELK Prometheus InfluxDB
Internal Use - Confidential
Our journey to Influxdb monitoring stack
Distributed storage monitoring
Influxdb beats competitors
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
Internal Use - Confidential
Single Influxdb
Node 5
Node 4
Node 3
Node 2
Node 1
Influxdb
Internal Use - Confidential
Telegraf on all nodes
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Internal Use - Confidential
Grafana on all nodes
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Internal Use - Confidential
No data if Node is down
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Internal Use - Confidential
Run 3 Influxdb
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb Influxdb
Influxdb datasources
Influxdb1
Influxdb2
Influxdb3
Internal Use - Confidential
Support 2 nodes down
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb Influxdb
Internal Use - Confidential
After failures some data may be unavailable
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb Influxdb
Node 5
Node 4
Node 3
Node 2
Node 1
Internal Use - Confidential
Internal Use - Confidential
Recover on startup
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb Influxdb
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Use backup-restore api
Internal Use - Confidential
All data available even in case of rolling failures
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb Influxdb
Internal Use - Confidential
Now we can even do node replacements
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb Influxdb
Internal Use - Confidential
High Availability for InfluxDB
Run multiple Influxdb instances
Use backup-restore api
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
Internal Use - Confidential
Select datasource manually in Grafana
Influxdb
Influxdb
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Grafana Grafana Grafana Grafana Grafana
Influxdb datasources
Influxdb1
Influxdb2
Influxdb3
Internal Use - Confidential
Run Fluxd service on all nodes
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Fluxd datasource
local fluxd
Internal Use - Confidential
Fluxd: Offload compute from Influxdb
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
from()|>filter()|>range()
complex
query
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Internal Use - Confidential
Fluxd: Load-balance complex compute
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
complex
query
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Internal Use - Confidential
Node 1
Fluxd: Stateless
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Telegraf Telegraf Telegraf Telegraf Telegraf
Internal Use - Confidential
Horizontally scalable query with Flux language
Single datasource
Offload compute from Influxdb
Load-balance requests
Horizontally scalable
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
Internal Use - Confidential
Need to use minimal resources and avoid oom
Node 5
Node 4
Node 3
Node 2
Node 1
Grafana Grafana Grafana Grafana Grafana
Influxdb
Influxdb
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Telegraf Telegraf Telegraf Telegraf Telegraf
Internal Use - Confidential
Telegraf
Node 3
Telegraf
Internal Use - Confidential
Services push metrics to Telegraf
Node
Telegraf
Service1 Service2 ServiceN
…
Internal Use - Confidential
Sometimes services may push more metrics
Telegraf
Service1 Service2 ServiceN
…
Node
Internal Use - Confidential
More metrics cause oom
Telegraf
Service1 Service2 ServiceN
…
Node
Internal Use - Confidential
Better drop metrics then die
Service1 Service2 ServiceN
…
Telegraf
Drop metrics when buffer is filled
Node
Internal Use - Confidential
Set buffer limit for Telegraf
Service1 Service2 ServiceN
…
Telegraf
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
Internal Use - Confidential
Telegraf has predictable memory for received metrics
Service1 Service2 ServiceN
…
Telegraf
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
Internal Use - Confidential
Telegraf has predictable memory when some Influxdb are down
Telegraf
Influxdb
Influxdb
Influxdb
Buffer per output
Node
Internal Use - Confidential
But still sometimes dies due to oom
Service1 Service2 ServiceN
…
Telegraf
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
Internal Use - Confidential
Telegraf used lots of input plugins
Telegraf
Influxdb
listener
Inputs
procstat mem
…
exec
Node
Internal Use - Confidential
Exec plugin uses unpredictable scripts
Telegraf
Inputs
exec
scripts
Node
Internal Use - Confidential
Unpredictable scripts cause oom
Telegraf
Inputs
exec
scripts
Node
Internal Use - Confidential
Get rid of using exec plugin
Telegraf
Influxdb
listener
Inputs
procstat mem
…
exec
Node
Internal Use - Confidential
Telegraf never dies
Service1 Service2 ServiceN
…
Telegraf
Influxdb
listener
Inputs
procstat mem
…
New metrics
Drop metrics when buffer is filled
metric_batch_size = 1000
metric_buffer_limit = 4000
Node
Internal Use - Confidential
Influxdb
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
Internal Use - Confidential
Influxdb memory driving factors
Number of metrics
Metrics cardinality
Retention period
Compute
Internal Use - Confidential
Number of metrics matters
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
InfluxDB
Internal Use - Confidential
Drop non-used metrics, prevent high cardinality
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
filter not used metrics
namepass
Internal Use - Confidential
Push less frequently if you can
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
push interval 5 min
Influxdb
Internal Use - Confidential
With full history Influxdb used more memory
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Internal Use - Confidential
Select shard duration carefully
Retention
…
shard shard shard shard
index index index index
Database
Shard count < 10
Internal Use - Confidential
All components resource consumption is under control
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
Internal Use - Confidential
ECS is operated by customer
Rack N
Rack 1
Datacenter
Node 4
Node 3
Node 2
Node 1 Node
N+3
Node
N+2
Node
N+1
Node N
Internal Use - Confidential
Customer sometime uses external monitoring
Rack 1
Datacenter
External
Monitoring
Node 4
Node 3
Node 2
Node 1
Internal Use - Confidential
Periodically poll fluxd for external monitoring
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
External
Monitoring
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
Internal Use - Confidential
Extra resources needed on Fluxd and Influxdb
Grafana Grafana Grafana Grafana Grafana
Fluxd Fluxd Fluxd Fluxd Fluxd
External
Monitoring
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb Influxdb
Influxdb
Internal Use - Confidential
Not all metrics are available in internal monitoring
filter some metrics
Grafana Grafana Grafana Grafana Grafana
Fluxd Fluxd Fluxd Fluxd Fluxd
External
Monitoring
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb Influxdb
Influxdb
Internal Use - Confidential
Push all metrics from telegrafs to external
External
Monitoring
Send all metrics
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
filter some metrics
Internal Use - Confidential
Extra continuous queries and dashboards on external
External
Monitoring
Send all metrics
Grafana Grafana Grafana Grafana Grafana
Influxdb
Fluxd Fluxd Fluxd Fluxd Fluxd
Node 5
Node 4
Node 3
Node 2
Node 1
Telegraf Telegraf Telegraf Telegraf Telegraf
Influxdb
Influxdb
filter some metrics
Internal Use - Confidential
Deploy on low memory resources
Limit telegraf buffer
Do not use exec input plugins
Offload compute from InfluxDB
Filter out non-needed metrics
Push with lower frequency
Push metrics to external
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
Internal Use - Confidential
How to switch monitoring stack
UI
Alerting framework
Node
Telegraf
Services
Grafana
Influxdb
Fluxd
Dashboard service
Statistic framework
• Our journey to Influxdb monitoring stack
• High Availability for InfluxDB
• Horizontally scalable query with Flux language
• Deploy on low memory resources
• How to switch monitoring stack
• Dashboards…
Internal Use - Confidential
Lots of new dashboards
were created
Internal Use - Confidential
Performance
Internal Use - Confidential
Internal Use - Confidential
System metrics
Internal Use - Confidential
Internal Use - Confidential
Top N buckets
Internal Use - Confidential
Internal Use - Confidential
And many more …
Summary
Internal Use - Confidential
Summary
InfluxDB is a great Timeseries Database
May add High Availability on top of OSS version
May fit into low memory resources
May use as internal monitoring in on-premise products
Good luck using it in your product
Questions? Feedback?
Let’s connect!
Email: maksim.vazhenin@dell.com
LinkedIn: https://www.linkedin.com/in/maksim-vazhenin/
1 of 83

Recommended

Intel DPDK Step by Step instructions by
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
56.9K views12 slides
DPDK: Multi Architecture High Performance Packet Processing by
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
9.1K views52 slides
Soc architecture and design by
Soc architecture and designSoc architecture and design
Soc architecture and designSatya Harish
6.1K views24 slides
Hardening Kafka Replication by
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication confluent
5.2K views232 slides
Solaris Linux Performance, Tools and Tuning by
Solaris Linux Performance, Tools and TuningSolaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and TuningAdrian Cockcroft
9.7K views97 slides
Time Sensitive Networking in the Linux Kernel by
Time Sensitive Networking in the Linux KernelTime Sensitive Networking in the Linux Kernel
Time Sensitive Networking in the Linux Kernelhenrikau
3.7K views33 slides

More Related Content

What's hot

MySQL 상태 메시지 분석 및 활용 by
MySQL 상태 메시지 분석 및 활용MySQL 상태 메시지 분석 및 활용
MySQL 상태 메시지 분석 및 활용I Goo Lee
5.9K views28 slides
Cellular technology with Embedded Linux - COSCUP 2016 by
Cellular technology with Embedded Linux - COSCUP 2016Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016SZ Lin
5.9K views64 slides
Performance Tuning EC2 Instances by
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 InstancesBrendan Gregg
171.6K views81 slides
DPDK KNI interface by
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interfaceDenys Haryachyy
26.1K views10 slides
The consequences of sync_binlog != 1 by
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1Jean-François Gagné
573 views32 slides
Review of QNX by
Review of QNXReview of QNX
Review of QNXRobert-Emmanuel Mayssat
3.4K views42 slides

What's hot(20)

MySQL 상태 메시지 분석 및 활용 by I Goo Lee
MySQL 상태 메시지 분석 및 활용MySQL 상태 메시지 분석 및 활용
MySQL 상태 메시지 분석 및 활용
I Goo Lee5.9K views
Cellular technology with Embedded Linux - COSCUP 2016 by SZ Lin
Cellular technology with Embedded Linux - COSCUP 2016Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016
SZ Lin5.9K views
Performance Tuning EC2 Instances by Brendan Gregg
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
Brendan Gregg171.6K views
Real Time Operating Systems by Ashwani Garg
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
Ashwani Garg2.5K views
PostgreSQL WAL for DBAs by PGConf APAC
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
PGConf APAC4.6K views
GPU Virtualization in Embedded Automotive Solutions by GlobalLogic Ukraine
GPU Virtualization in Embedded Automotive SolutionsGPU Virtualization in Embedded Automotive Solutions
GPU Virtualization in Embedded Automotive Solutions
GlobalLogic Ukraine2.2K views
Deep dive in container service discovery by Docker, Inc.
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discovery
Docker, Inc.1.5K views
Project ACRN hypervisor introduction by Project ACRN
Project ACRN hypervisor introduction Project ACRN hypervisor introduction
Project ACRN hypervisor introduction
Project ACRN170 views
Top 5 Mistakes to Avoid When Writing Apache Spark Applications by Cloudera, Inc.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.127.8K views
Juniper Networks Router Architecture by lawuah
Juniper Networks Router ArchitectureJuniper Networks Router Architecture
Juniper Networks Router Architecture
lawuah9.7K views
NFF-GO (YANFF) - Yet Another Network Function Framework by Michelle Holley
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function Framework
Michelle Holley4.8K views
The linux networking architecture by hugo lu
The linux networking architectureThe linux networking architecture
The linux networking architecture
hugo lu29.3K views
리눅스 커널 디버거 KGDB/KDB by Manjong Han
리눅스 커널 디버거 KGDB/KDB리눅스 커널 디버거 KGDB/KDB
리눅스 커널 디버거 KGDB/KDB
Manjong Han4.1K views
Linux Performance Analysis and Tools by Brendan Gregg
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
Brendan Gregg531.1K views

Similar to Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021

Snabbflow: A Scalable IPFIX exporter by
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterIgalia
45 views26 slides
How to Use Telegraf and Its Plugin Ecosystem by
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemInfluxData
1.2K views42 slides
Oracle Enterprise manager SNMP and Exadata by
Oracle Enterprise manager SNMP and ExadataOracle Enterprise manager SNMP and Exadata
Oracle Enterprise manager SNMP and ExadataMike Chafin
4.6K views22 slides
running stable diffusion on android by
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on androidKoan-Sin Tan
288 views43 slides
Getting Started: Intro to Telegraf - July 2021 by
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021InfluxData
536 views48 slides
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP... by
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
5.6K views22 slides

Similar to Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021 (20)

Snabbflow: A Scalable IPFIX exporter by Igalia
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporter
Igalia45 views
How to Use Telegraf and Its Plugin Ecosystem by InfluxData
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
InfluxData1.2K views
Oracle Enterprise manager SNMP and Exadata by Mike Chafin
Oracle Enterprise manager SNMP and ExadataOracle Enterprise manager SNMP and Exadata
Oracle Enterprise manager SNMP and Exadata
Mike Chafin4.6K views
running stable diffusion on android by Koan-Sin Tan
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
Koan-Sin Tan288 views
Getting Started: Intro to Telegraf - July 2021 by InfluxData
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
InfluxData536 views
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP... by Edge AI and Vision Alliance
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
Plane Spotting by Ted Coyle
Plane SpottingPlane Spotting
Plane Spotting
Ted Coyle53 views
Patterns of-streaming-applications-qcon-2018-monal-daxini by Monal Daxini
Patterns of-streaming-applications-qcon-2018-monal-daxiniPatterns of-streaming-applications-qcon-2018-monal-daxini
Patterns of-streaming-applications-qcon-2018-monal-daxini
Monal Daxini333 views
Hungary Usergroup - Midonet overlay programming by Marton Kiss
Hungary Usergroup - Midonet overlay programmingHungary Usergroup - Midonet overlay programming
Hungary Usergroup - Midonet overlay programming
Marton Kiss397 views
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard by InfluxData
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxData379 views
Flink at netflix paypal speaker series by Monal Daxini
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
Monal Daxini3.7K views
CampusSDN2017 - Jawdat: SDN Technology Evolvement by JawdatTI
CampusSDN2017 - Jawdat: SDN Technology EvolvementCampusSDN2017 - Jawdat: SDN Technology Evolvement
CampusSDN2017 - Jawdat: SDN Technology Evolvement
JawdatTI219 views
Oracle no sql release 3 4 overview by Anand Chandak
Oracle no sql release 3 4 overviewOracle no sql release 3 4 overview
Oracle no sql release 3 4 overview
Anand Chandak856 views
Measure your app internals with InfluxDB and Symfony2 by Corley S.r.l.
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
Corley S.r.l.4.5K views
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc... by Codemotion
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
Codemotion124 views

More from InfluxData

Announcing InfluxDB Clustered by
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB ClusteredInfluxData
100 views30 slides
Best Practices for Leveraging the Apache Arrow Ecosystem by
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemInfluxData
50 views25 slides
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu... by
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...InfluxData
134 views24 slides
Power Your Predictive Analytics with InfluxDB by
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBInfluxData
127 views41 slides
Build an Edge-to-Cloud Solution with the MING Stack by
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackInfluxData
375 views52 slides
Meet the Founders: An Open Discussion About Rewriting Using Rust by
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustInfluxData
235 views12 slides

More from InfluxData(20)

Announcing InfluxDB Clustered by InfluxData
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
InfluxData100 views
Best Practices for Leveraging the Apache Arrow Ecosystem by InfluxData
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData50 views
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu... by InfluxData
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
InfluxData134 views
Power Your Predictive Analytics with InfluxDB by InfluxData
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
InfluxData127 views
Build an Edge-to-Cloud Solution with the MING Stack by InfluxData
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
InfluxData375 views
Meet the Founders: An Open Discussion About Rewriting Using Rust by InfluxData
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
InfluxData235 views
Introducing InfluxDB Cloud Dedicated by InfluxData
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
InfluxData129 views
Gain Better Observability with OpenTelemetry and InfluxDB by InfluxData
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
InfluxData392 views
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali... by InfluxData
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
InfluxData182 views
How Delft University's Engineering Students Make Their EV Formula-Style Race ... by InfluxData
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData105 views
Start Automating InfluxDB Deployments at the Edge with balena by InfluxData
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
InfluxData185 views
Understanding InfluxDB’s New Storage Engine by InfluxData
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
InfluxData134 views
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB by InfluxData
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
InfluxData63 views
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa... by InfluxData
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
InfluxData74 views
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022 by InfluxData
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
InfluxData26 views
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022 by InfluxData
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
InfluxData9 views
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ... by InfluxData
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData10 views
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022 by InfluxData
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData5 views
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022 by InfluxData
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
InfluxData112 views
Jay Clifford [InfluxData] | Tips & Tricks for Analyzing IIoT in Real-Time | I... by InfluxData
Jay Clifford [InfluxData] | Tips & Tricks for Analyzing IIoT in Real-Time | I...Jay Clifford [InfluxData] | Tips & Tricks for Analyzing IIoT in Real-Time | I...
Jay Clifford [InfluxData] | Tips & Tricks for Analyzing IIoT in Real-Time | I...
InfluxData19 views

Recently uploaded

handbook for web 3 adoption.pdf by
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdfLiveplex
22 views16 slides
20231123_Camunda Meetup Vienna.pdf by
20231123_Camunda Meetup Vienna.pdf20231123_Camunda Meetup Vienna.pdf
20231123_Camunda Meetup Vienna.pdfPhactum Softwareentwicklung GmbH
41 views73 slides
Melek BEN MAHMOUD.pdf by
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
14 views1 slide
Democratising digital commerce in India-Report by
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-ReportKapil Khandelwal (KK)
15 views161 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
263 views86 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
55 views46 slides

Recently uploaded(20)

handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software263 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada136 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker37 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst478 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn22 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views

Maksim Vazhenin [Dell Technologies] | InfluxDB for Storage System Monitoring | InfluxDays EMEA 2021

  • 1. Maksim Vazhenin Software Sr Principal Engineer Dell Technologies InfluxDB for Storage System Monitoring
  • 2. Internal Use - Confidential | Agenda Our journey to Influxdb monitoring stack High Availability for InfluxDB Horizontally scalable query with Flux language Deploy on low memory resources How to switch monitoring stack Dashboards…
  • 3. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  • 4. Internal Use - Confidential SITE 1 SITE 3 SITE 2 IoT Financial Services Media & Entertainment Cloud Backup Archive Modern Apps Evidence Repository Analytics ECS
  • 5. Internal Use - Confidential ECS is deployed in physical nodes combined by racks Node 1 Node 2 Node 3 Node 4 Node N Node N+1 Node N+2 Node N+3 Rack N Rack 1 Datacenter
  • 6. Internal Use - Confidential ECS is deployed in Docker containers
  • 7. Internal Use - Confidential ECS internal monitoring data Performance System monitoring Internal health metrics Capacity (lots of complicated compute from may services)
  • 8. Internal Use - Confidential Existing monitoring solution disadvantages Different teams involved to show data on UI Code change to add new dashboard No flexible query language Slow on large queries
  • 9. Internal Use - Confidential Need for modern monitoring stack Easy to build dashboards System resources monitoring Easy to create alerts Autonomous service teams
  • 10. Internal Use - Confidential Challenges High scale (~300 nodes clusters) No free resources
  • 11. Internal Use - Confidential Alternatives ELK Prometheus InfluxDB
  • 12. Internal Use - Confidential Alternatives ELK High resource requirements Flexible analytics
  • 13. Internal Use - Confidential Alternatives Prometheus High cardinality Bad when working with rare data Bad at counting exact values Performance Query language Does not support backfilling
  • 14. Internal Use - Confidential Alternatives Prometheus 12 15 extrapolation 17 increase(4m) Polling interval 2m extrapolation
  • 15. Internal Use - Confidential Alternatives InfluxDB High cardinality InfluxQL Performance Can be used for exact compute Supports backfilling Flux Query language
  • 16. Internal Use - Confidential Alternatives ELK Prometheus InfluxDB
  • 17. Internal Use - Confidential Our journey to Influxdb monitoring stack Distributed storage monitoring Influxdb beats competitors
  • 18. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  • 19. Internal Use - Confidential Single Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Influxdb
  • 20. Internal Use - Confidential Telegraf on all nodes Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  • 21. Internal Use - Confidential Grafana on all nodes Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  • 22. Internal Use - Confidential No data if Node is down Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  • 23. Internal Use - Confidential Run 3 Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb Influxdb datasources Influxdb1 Influxdb2 Influxdb3
  • 24. Internal Use - Confidential Support 2 nodes down Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  • 25. Internal Use - Confidential After failures some data may be unavailable Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb Node 5 Node 4 Node 3 Node 2 Node 1
  • 26. Internal Use - Confidential
  • 27. Internal Use - Confidential Recover on startup Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Use backup-restore api
  • 28. Internal Use - Confidential All data available even in case of rolling failures Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  • 29. Internal Use - Confidential Now we can even do node replacements Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  • 30. Internal Use - Confidential High Availability for InfluxDB Run multiple Influxdb instances Use backup-restore api
  • 31. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  • 32. Internal Use - Confidential Select datasource manually in Grafana Influxdb Influxdb Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Grafana Grafana Grafana Grafana Grafana Influxdb datasources Influxdb1 Influxdb2 Influxdb3
  • 33. Internal Use - Confidential Run Fluxd service on all nodes Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Fluxd datasource local fluxd
  • 34. Internal Use - Confidential Fluxd: Offload compute from Influxdb Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd from()|>filter()|>range() complex query Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf
  • 35. Internal Use - Confidential Fluxd: Load-balance complex compute Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd complex query Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf
  • 36. Internal Use - Confidential Node 1 Fluxd: Stateless Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Telegraf Telegraf Telegraf Telegraf Telegraf
  • 37. Internal Use - Confidential Horizontally scalable query with Flux language Single datasource Offload compute from Influxdb Load-balance requests Horizontally scalable
  • 38. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  • 39. Internal Use - Confidential Need to use minimal resources and avoid oom Node 5 Node 4 Node 3 Node 2 Node 1 Grafana Grafana Grafana Grafana Grafana Influxdb Influxdb Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Telegraf Telegraf Telegraf Telegraf Telegraf
  • 40. Internal Use - Confidential Telegraf Node 3 Telegraf
  • 41. Internal Use - Confidential Services push metrics to Telegraf Node Telegraf Service1 Service2 ServiceN …
  • 42. Internal Use - Confidential Sometimes services may push more metrics Telegraf Service1 Service2 ServiceN … Node
  • 43. Internal Use - Confidential More metrics cause oom Telegraf Service1 Service2 ServiceN … Node
  • 44. Internal Use - Confidential Better drop metrics then die Service1 Service2 ServiceN … Telegraf Drop metrics when buffer is filled Node
  • 45. Internal Use - Confidential Set buffer limit for Telegraf Service1 Service2 ServiceN … Telegraf metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  • 46. Internal Use - Confidential Telegraf has predictable memory for received metrics Service1 Service2 ServiceN … Telegraf metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  • 47. Internal Use - Confidential Telegraf has predictable memory when some Influxdb are down Telegraf Influxdb Influxdb Influxdb Buffer per output Node
  • 48. Internal Use - Confidential But still sometimes dies due to oom Service1 Service2 ServiceN … Telegraf metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  • 49. Internal Use - Confidential Telegraf used lots of input plugins Telegraf Influxdb listener Inputs procstat mem … exec Node
  • 50. Internal Use - Confidential Exec plugin uses unpredictable scripts Telegraf Inputs exec scripts Node
  • 51. Internal Use - Confidential Unpredictable scripts cause oom Telegraf Inputs exec scripts Node
  • 52. Internal Use - Confidential Get rid of using exec plugin Telegraf Influxdb listener Inputs procstat mem … exec Node
  • 53. Internal Use - Confidential Telegraf never dies Service1 Service2 ServiceN … Telegraf Influxdb listener Inputs procstat mem … New metrics Drop metrics when buffer is filled metric_batch_size = 1000 metric_buffer_limit = 4000 Node
  • 54. Internal Use - Confidential Influxdb Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb
  • 55. Internal Use - Confidential Influxdb memory driving factors Number of metrics Metrics cardinality Retention period Compute
  • 56. Internal Use - Confidential Number of metrics matters Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf InfluxDB
  • 57. Internal Use - Confidential Drop non-used metrics, prevent high cardinality Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb filter not used metrics namepass
  • 58. Internal Use - Confidential Push less frequently if you can Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf push interval 5 min Influxdb
  • 59. Internal Use - Confidential With full history Influxdb used more memory Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb
  • 60. Internal Use - Confidential Select shard duration carefully Retention … shard shard shard shard index index index index Database Shard count < 10
  • 61. Internal Use - Confidential All components resource consumption is under control Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb
  • 62. Internal Use - Confidential ECS is operated by customer Rack N Rack 1 Datacenter Node 4 Node 3 Node 2 Node 1 Node N+3 Node N+2 Node N+1 Node N
  • 63. Internal Use - Confidential Customer sometime uses external monitoring Rack 1 Datacenter External Monitoring Node 4 Node 3 Node 2 Node 1
  • 64. Internal Use - Confidential Periodically poll fluxd for external monitoring Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd External Monitoring Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb
  • 65. Internal Use - Confidential Extra resources needed on Fluxd and Influxdb Grafana Grafana Grafana Grafana Grafana Fluxd Fluxd Fluxd Fluxd Fluxd External Monitoring Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  • 66. Internal Use - Confidential Not all metrics are available in internal monitoring filter some metrics Grafana Grafana Grafana Grafana Grafana Fluxd Fluxd Fluxd Fluxd Fluxd External Monitoring Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb Influxdb
  • 67. Internal Use - Confidential Push all metrics from telegrafs to external External Monitoring Send all metrics Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb filter some metrics
  • 68. Internal Use - Confidential Extra continuous queries and dashboards on external External Monitoring Send all metrics Grafana Grafana Grafana Grafana Grafana Influxdb Fluxd Fluxd Fluxd Fluxd Fluxd Node 5 Node 4 Node 3 Node 2 Node 1 Telegraf Telegraf Telegraf Telegraf Telegraf Influxdb Influxdb filter some metrics
  • 69. Internal Use - Confidential Deploy on low memory resources Limit telegraf buffer Do not use exec input plugins Offload compute from InfluxDB Filter out non-needed metrics Push with lower frequency Push metrics to external
  • 70. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  • 71. Internal Use - Confidential How to switch monitoring stack UI Alerting framework Node Telegraf Services Grafana Influxdb Fluxd Dashboard service Statistic framework
  • 72. • Our journey to Influxdb monitoring stack • High Availability for InfluxDB • Horizontally scalable query with Flux language • Deploy on low memory resources • How to switch monitoring stack • Dashboards…
  • 73. Internal Use - Confidential Lots of new dashboards were created
  • 74. Internal Use - Confidential Performance
  • 75. Internal Use - Confidential
  • 76. Internal Use - Confidential System metrics
  • 77. Internal Use - Confidential
  • 78. Internal Use - Confidential Top N buckets
  • 79. Internal Use - Confidential
  • 80. Internal Use - Confidential And many more …
  • 82. Internal Use - Confidential Summary InfluxDB is a great Timeseries Database May add High Availability on top of OSS version May fit into low memory resources May use as internal monitoring in on-premise products Good luck using it in your product
  • 83. Questions? Feedback? Let’s connect! Email: maksim.vazhenin@dell.com LinkedIn: https://www.linkedin.com/in/maksim-vazhenin/