SlideShare a Scribd company logo
How to reduce expenses on monitoring
with VictoriaMetrics
Roman Khavronenko | github.com/hagen1778
Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778
What this talk is about
1. Best ways for storing and processing metrics
2. Open source tools only
3. For people familiar with Prometheus,
Thanos, Mimir, VictoriaMetrics
Expenses!
You can either have a faster car…
…or be a smarter driver!
What can you get from simple replacing?
Prometheus remote-write benchmark
Prometheus vs VictoriaMetrics benchmark
# the number of nodeexporter instances to scrape
targetsCount: 1000
# how frequently to scrape nodeexporter targets
scrapeInterval: 15s
# rules evaluation interval
# https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1
queryInterval: 30s
# scrapeConfigUpdatePercent is a churn rate generated once
# per scrapeConfigUpdateInterval
scrapeConfigUpdatePercent: 5
scrapeConfigUpdateInterval: 10m
Prometheus vs VictoriaMetrics benchmark
x16 times faster!
x1.9 times faster!
x1.7 less memory!
x2.5 times less!
Summary after 7d benchmark (1k nodeexporter targets)
Prometheus:
CPU avg used: 0.79 / 3 cores
Disk occupied: 83.5 GiB
Mem max used: 8.12 GiB / 12 GiB
Read latency avg:
50th - 70.5ms
99th - 7s
VictoriaMetrics:
CPU avg used: 0.76 / 3 cores
Disk occupied: 33 GiB
Mem max used: 4.5 GiB / 12 GiB
Read latency avg:
50th - 4.3ms
99th - 3.6s
Data transfer costs
Network Data transfer costs
x4.5 times less!
Improving network compression
1. Increase compression level, trade CPU for network savings:
a. -remoteWrite.vmProtoCompressLevel
2. Increase batch size, trade latency for compression:
a. -remoteWrite.maxBlockSize
b. -remoteWrite.maxRowsPerBlock
c. -remoteWrite.flushInterval
3. Reduce entropy to improve compression:
a. -remoteWrite.significantFigures
b. -remoteWrite.roundDigits
How to be smarter about data
Keeping only significant figures
instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781
instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236
rules:
- record: instance:cpu_utilization:ratio_avg
expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
Keeping only significant figures
Applying --vm-significant-figures=8 to recording rules
0.05055757575781
0.050557576
changed compression ratio from 1.2B to 0.8B per sample
See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
Understanding the data - query tracing
VictoriaMetrics supports query tracing for detecting bottlenecks during query processing.
This is like EXPLAIN ANALYZE from Postgresql!
https://play.victoriametrics.com
Query tracing demo!
If query tracing demo didn't work…
Typical query takes 4s to execute… Why?
If query tracing demo didn't work…
Let's check the trace!
If query tracing demo didn't work…
91% of the time was spent on vmselect while aggregating
9.4k series, 13Mil data samples!
How to improve query speed?
1. Add more resources to monitoring.
2. Or… be smarter about data!
Cardinality explorer demo!
https://play.victoriametrics.com
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
Cardinality explorer: summary
VictoriaMetrics allows exploring time series cardinality to identify:
● Metric names with the highest number of series
● Labels with the highest number of series
● Values with the highest number of series for the selected label
● label=name pairs with the highest number of series
● Labels with the highest number of unique values
➔ Available built-in in VictoriaMetrics components
➔ Supports specifying Prometheus URL
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is Data-in + Recording Rules results
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is only what needs to be persisted
How to use streaming aggregation
- match: "grpc_server_handled_total" # time series selector
interval: "2m" # on 2m interval
outputs: ["total"] # aggregate as counter
without: ["grpc_method"] # group without label
Result:
grpc_server_handled_total:2m_without_grpc_method_total
How to use streaming aggregation
https://play.victoriametrics.com
Streaming aggregation: summary
1. Aggregate incoming samples in streaming mode before data is written to remote
storage
2. Aggregation is applied to all the metrics received via any supported data
ingestion protocol and/or scraped from Prometheus-compatible targets
3. Statsd alternative
4. Recording rules alternative
5. Reducing the number of stored samples
6. Reducing the number of stored series
7. Compatible with tools supporting Prometheus remote write protocol
Complexity penalty
Cortex architecture
Mimir architecture
VictoriaMetrics architecture
Complexity penalty
● Complex systems are harder to maintain
● Complex systems are harder to educate about
● Complex systems are more expensive to scale
Additional materials
1. Snapshot of Grafana dashboard from the benchmark
2. Benchmark repo for reproducing the test
3. Save network costs with VictoriaMetrics remote write protocol
4. VictoriaMetrics: achieving better compression than Gorilla for time series data
5. Streaming aggregation
6. VictoriaMetrics playground
Questions?
● https://github.com/VictoriaMetrics
● https://github.com/hagen1778

More Related Content

What's hot

Container Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyondContainer Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyond
KubeAcademy
 

What's hot (20)

TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Container Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyondContainer Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyond
 
Kubernetes PPT.pptx
Kubernetes PPT.pptxKubernetes PPT.pptx
Kubernetes PPT.pptx
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
 
Introduction to the Container Network Interface (CNI)
Introduction to the Container Network Interface (CNI)Introduction to the Container Network Interface (CNI)
Introduction to the Container Network Interface (CNI)
 
DevOps Days Kyiv 2019 -- Victoria Metrics // Artem Navoiev
DevOps Days Kyiv 2019 -- Victoria Metrics // Artem NavoievDevOps Days Kyiv 2019 -- Victoria Metrics // Artem Navoiev
DevOps Days Kyiv 2019 -- Victoria Metrics // Artem Navoiev
 
MariaDB MaxScale
MariaDB MaxScaleMariaDB MaxScale
MariaDB MaxScale
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Get Hands-On with NGINX and QUIC+HTTP/3
Get Hands-On with NGINX and QUIC+HTTP/3Get Hands-On with NGINX and QUIC+HTTP/3
Get Hands-On with NGINX and QUIC+HTTP/3
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using Loki
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
Nginx Essential
Nginx EssentialNginx Essential
Nginx Essential
 
Introduce Google Kubernetes
Introduce Google KubernetesIntroduce Google Kubernetes
Introduce Google Kubernetes
 
Learn nginx in 90mins
Learn nginx in 90minsLearn nginx in 90mins
Learn nginx in 90mins
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 

Similar to How to reduce expenses on monitoring

stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
NETWAYS
 
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte
 
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
QAware GmbH
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit
 

Similar to How to reduce expenses on monitoring (20)

stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
 
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
 
observability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new software
 
Kafka monitoring and metrics
Kafka monitoring and metricsKafka monitoring and metrics
Kafka monitoring and metrics
 
Prometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the Cloud
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Prelim Slides
Prelim SlidesPrelim Slides
Prelim Slides
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
 
Monitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus Stack
 
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Query Optimization with MySQL 8.0 and MariaDB 10.3: The BasicsQuery Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
 
Overcoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystemOvercoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystem
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring Latency
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 

How to reduce expenses on monitoring

  • 1. How to reduce expenses on monitoring with VictoriaMetrics Roman Khavronenko | github.com/hagen1778
  • 2. Roman Khavronenko Co-founder of VictoriaMetrics Software engineer with experience in distributed systems, monitoring and high-performance services. https://github.com/hagen1778 https://twitter.com/hagen1778
  • 3. What this talk is about 1. Best ways for storing and processing metrics 2. Open source tools only 3. For people familiar with Prometheus, Thanos, Mimir, VictoriaMetrics
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 10. You can either have a faster car… …or be a smarter driver!
  • 11. What can you get from simple replacing?
  • 12.
  • 15. # the number of nodeexporter instances to scrape targetsCount: 1000 # how frequently to scrape nodeexporter targets scrapeInterval: 15s # rules evaluation interval # https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1 queryInterval: 30s # scrapeConfigUpdatePercent is a churn rate generated once # per scrapeConfigUpdateInterval scrapeConfigUpdatePercent: 5 scrapeConfigUpdateInterval: 10m Prometheus vs VictoriaMetrics benchmark
  • 16.
  • 17.
  • 18.
  • 23.
  • 24. Summary after 7d benchmark (1k nodeexporter targets) Prometheus: CPU avg used: 0.79 / 3 cores Disk occupied: 83.5 GiB Mem max used: 8.12 GiB / 12 GiB Read latency avg: 50th - 70.5ms 99th - 7s VictoriaMetrics: CPU avg used: 0.76 / 3 cores Disk occupied: 33 GiB Mem max used: 4.5 GiB / 12 GiB Read latency avg: 50th - 4.3ms 99th - 3.6s
  • 28. Improving network compression 1. Increase compression level, trade CPU for network savings: a. -remoteWrite.vmProtoCompressLevel 2. Increase batch size, trade latency for compression: a. -remoteWrite.maxBlockSize b. -remoteWrite.maxRowsPerBlock c. -remoteWrite.flushInterval 3. Reduce entropy to improve compression: a. -remoteWrite.significantFigures b. -remoteWrite.roundDigits
  • 29. How to be smarter about data
  • 30. Keeping only significant figures instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781 instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236 rules: - record: instance:cpu_utilization:ratio_avg expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
  • 31. Keeping only significant figures Applying --vm-significant-figures=8 to recording rules 0.05055757575781 0.050557576 changed compression ratio from 1.2B to 0.8B per sample See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
  • 32. Understanding the data - query tracing VictoriaMetrics supports query tracing for detecting bottlenecks during query processing. This is like EXPLAIN ANALYZE from Postgresql!
  • 34. If query tracing demo didn't work… Typical query takes 4s to execute… Why?
  • 35. If query tracing demo didn't work… Let's check the trace!
  • 36. If query tracing demo didn't work… 91% of the time was spent on vmselect while aggregating 9.4k series, 13Mil data samples!
  • 37. How to improve query speed? 1. Add more resources to monitoring. 2. Or… be smarter about data!
  • 39. If cardinality explorer demo didn't work…
  • 40. If cardinality explorer demo didn't work…
  • 41. If cardinality explorer demo didn't work…
  • 42. Cardinality explorer: summary VictoriaMetrics allows exploring time series cardinality to identify: ● Metric names with the highest number of series ● Labels with the highest number of series ● Values with the highest number of series for the selected label ● label=name pairs with the highest number of series ● Labels with the highest number of unique values ➔ Available built-in in VictoriaMetrics components ➔ Supports specifying Prometheus URL
  • 43. Streaming aggregation vs Recording rules The number of time series stored in TSDB is Data-in + Recording Rules results
  • 44. Streaming aggregation vs Recording rules The number of time series stored in TSDB is only what needs to be persisted
  • 45. How to use streaming aggregation - match: "grpc_server_handled_total" # time series selector interval: "2m" # on 2m interval outputs: ["total"] # aggregate as counter without: ["grpc_method"] # group without label Result: grpc_server_handled_total:2m_without_grpc_method_total
  • 46. How to use streaming aggregation https://play.victoriametrics.com
  • 47. Streaming aggregation: summary 1. Aggregate incoming samples in streaming mode before data is written to remote storage 2. Aggregation is applied to all the metrics received via any supported data ingestion protocol and/or scraped from Prometheus-compatible targets 3. Statsd alternative 4. Recording rules alternative 5. Reducing the number of stored samples 6. Reducing the number of stored series 7. Compatible with tools supporting Prometheus remote write protocol
  • 52. Complexity penalty ● Complex systems are harder to maintain ● Complex systems are harder to educate about ● Complex systems are more expensive to scale
  • 53. Additional materials 1. Snapshot of Grafana dashboard from the benchmark 2. Benchmark repo for reproducing the test 3. Save network costs with VictoriaMetrics remote write protocol 4. VictoriaMetrics: achieving better compression than Gorilla for time series data 5. Streaming aggregation 6. VictoriaMetrics playground