How to reduce expenses on monitoring

How to reduce expenses on monitoring
with VictoriaMetrics
Roman Khavronenko | github.com/hagen1778
Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778
What this talk is about
1. Best ways for storing and processing metrics
2. Open source tools only
3. For people familiar with Prometheus,
Thanos, Mimir, VictoriaMetrics
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
Expenses!
You can either have a faster car…
…or be a smarter driver!
What can you get from simple replacing?
How to reduce expenses on monitoring
Prometheus remote-write benchmark
Prometheus vs VictoriaMetrics benchmark
# the number of nodeexporter instances to scrape
targetsCount: 1000
# how frequently to scrape nodeexporter targets
scrapeInterval: 15s
# rules evaluation interval
# https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1
queryInterval: 30s
# scrapeConfigUpdatePercent is a churn rate generated once
# per scrapeConfigUpdateInterval
scrapeConfigUpdatePercent: 5
scrapeConfigUpdateInterval: 10m
Prometheus vs VictoriaMetrics benchmark
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
x16 times faster!
x1.9 times faster!
x1.7 less memory!
x2.5 times less!
How to reduce expenses on monitoring
Summary after 7d benchmark (1k nodeexporter targets)
Prometheus:
CPU avg used: 0.79 / 3 cores
Disk occupied: 83.5 GiB
Mem max used: 8.12 GiB / 12 GiB
Read latency avg:
50th - 70.5ms
99th - 7s
VictoriaMetrics:
CPU avg used: 0.76 / 3 cores
Disk occupied: 33 GiB
Mem max used: 4.5 GiB / 12 GiB
Read latency avg:
50th - 4.3ms
99th - 3.6s
Data transfer costs
Network Data transfer costs
x4.5 times less!
Improving network compression
1. Increase compression level, trade CPU for network savings:
a. -remoteWrite.vmProtoCompressLevel
2. Increase batch size, trade latency for compression:
a. -remoteWrite.maxBlockSize
b. -remoteWrite.maxRowsPerBlock
c. -remoteWrite.flushInterval
3. Reduce entropy to improve compression:
a. -remoteWrite.significantFigures
b. -remoteWrite.roundDigits
How to be smarter about data
Keeping only significant figures
instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781
instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236
rules:
- record: instance:cpu_utilization:ratio_avg
expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
Keeping only significant figures
Applying --vm-significant-figures=8 to recording rules
0.05055757575781
0.050557576
changed compression ratio from 1.2B to 0.8B per sample
See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
Understanding the data - query tracing
VictoriaMetrics supports query tracing for detecting bottlenecks during query processing.
This is like EXPLAIN ANALYZE from Postgresql!
https://play.victoriametrics.com
Query tracing demo!
If query tracing demo didn't work…
Typical query takes 4s to execute… Why?
If query tracing demo didn't work…
Let's check the trace!
If query tracing demo didn't work…
91% of the time was spent on vmselect while aggregating
9.4k series, 13Mil data samples!
How to improve query speed?
1. Add more resources to monitoring.
2. Or… be smarter about data!
Cardinality explorer demo!
https://play.victoriametrics.com
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
Cardinality explorer: summary
VictoriaMetrics allows exploring time series cardinality to identify:
● Metric names with the highest number of series
● Labels with the highest number of series
● Values with the highest number of series for the selected label
● label=name pairs with the highest number of series
● Labels with the highest number of unique values
➔ Available built-in in VictoriaMetrics components
➔ Supports specifying Prometheus URL
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is Data-in + Recording Rules results
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is only what needs to be persisted
How to use streaming aggregation
- match: "grpc_server_handled_total" # time series selector
interval: "2m" # on 2m interval
outputs: ["total"] # aggregate as counter
without: ["grpc_method"] # group without label
Result:
grpc_server_handled_total:2m_without_grpc_method_total
How to use streaming aggregation
https://play.victoriametrics.com
Streaming aggregation: summary
1. Aggregate incoming samples in streaming mode before data is written to remote
storage
2. Aggregation is applied to all the metrics received via any supported data
ingestion protocol and/or scraped from Prometheus-compatible targets
3. Statsd alternative
4. Recording rules alternative
5. Reducing the number of stored samples
6. Reducing the number of stored series
7. Compatible with tools supporting Prometheus remote write protocol
Complexity penalty
Cortex architecture
Mimir architecture
VictoriaMetrics architecture
Complexity penalty
● Complex systems are harder to maintain
● Complex systems are harder to educate about
● Complex systems are more expensive to scale
Additional materials
1. Snapshot of Grafana dashboard from the benchmark
2. Benchmark repo for reproducing the test
3. Save network costs with VictoriaMetrics remote write protocol
4. VictoriaMetrics: achieving better compression than Gorilla for time series data
5. Streaming aggregation
6. VictoriaMetrics playground
Questions?
● https://github.com/VictoriaMetrics
● https://github.com/hagen1778
1 of 54

More Related Content

What's hot(20)

The basics of fluentdThe basics of fluentd
The basics of fluentd
Treasure Data, Inc.29.1K views
Get started with gitops and fluxGet started with gitops and flux
Get started with gitops and flux
LibbySchulze1645 views
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In Deep
Mydbops632 views
Kafka Quotas Talk at LinkedInKafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedIn
Aditya Auradkar2.8K views
Argocd up and runningArgocd up and running
Argocd up and running
Raphaël PINSON209 views
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
Lhouceine OUHAMZA745 views
Red Hat OpenShift Container StorageRed Hat OpenShift Container Storage
Red Hat OpenShift Container Storage
Takuya Utsunomiya3.3K views
Cilium + Istio with Gloo MeshCilium + Istio with Gloo Mesh
Cilium + Istio with Gloo Mesh
Christian Posta594 views
What is new in BIND 9.11?What is new in BIND 9.11?
What is new in BIND 9.11?
Men and Mice3.9K views

Similar to How to reduce expenses on monitoring(20)

Kafka monitoring and metricsKafka monitoring and metrics
Kafka monitoring and metrics
Touraj Ebrahimi2.1K views
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
Dr. Prakash Sahu113 views
Prelim SlidesPrelim Slides
Prelim Slides
smpant347 views
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
WMG centre High Value Manufacturing Catapult25 views
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil4.2K views
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S3.5K views

Recently uploaded(20)

How to reduce expenses on monitoring

  • 1. How to reduce expenses on monitoring with VictoriaMetrics Roman Khavronenko | github.com/hagen1778
  • 2. Roman Khavronenko Co-founder of VictoriaMetrics Software engineer with experience in distributed systems, monitoring and high-performance services. https://github.com/hagen1778 https://twitter.com/hagen1778
  • 3. What this talk is about 1. Best ways for storing and processing metrics 2. Open source tools only 3. For people familiar with Prometheus, Thanos, Mimir, VictoriaMetrics
  • 10. You can either have a faster car… …or be a smarter driver!
  • 11. What can you get from simple replacing?
  • 15. # the number of nodeexporter instances to scrape targetsCount: 1000 # how frequently to scrape nodeexporter targets scrapeInterval: 15s # rules evaluation interval # https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1 queryInterval: 30s # scrapeConfigUpdatePercent is a churn rate generated once # per scrapeConfigUpdateInterval scrapeConfigUpdatePercent: 5 scrapeConfigUpdateInterval: 10m Prometheus vs VictoriaMetrics benchmark
  • 24. Summary after 7d benchmark (1k nodeexporter targets) Prometheus: CPU avg used: 0.79 / 3 cores Disk occupied: 83.5 GiB Mem max used: 8.12 GiB / 12 GiB Read latency avg: 50th - 70.5ms 99th - 7s VictoriaMetrics: CPU avg used: 0.76 / 3 cores Disk occupied: 33 GiB Mem max used: 4.5 GiB / 12 GiB Read latency avg: 50th - 4.3ms 99th - 3.6s
  • 28. Improving network compression 1. Increase compression level, trade CPU for network savings: a. -remoteWrite.vmProtoCompressLevel 2. Increase batch size, trade latency for compression: a. -remoteWrite.maxBlockSize b. -remoteWrite.maxRowsPerBlock c. -remoteWrite.flushInterval 3. Reduce entropy to improve compression: a. -remoteWrite.significantFigures b. -remoteWrite.roundDigits
  • 29. How to be smarter about data
  • 30. Keeping only significant figures instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781 instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236 rules: - record: instance:cpu_utilization:ratio_avg expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
  • 31. Keeping only significant figures Applying --vm-significant-figures=8 to recording rules 0.05055757575781 0.050557576 changed compression ratio from 1.2B to 0.8B per sample See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
  • 32. Understanding the data - query tracing VictoriaMetrics supports query tracing for detecting bottlenecks during query processing. This is like EXPLAIN ANALYZE from Postgresql!
  • 34. If query tracing demo didn't work… Typical query takes 4s to execute… Why?
  • 35. If query tracing demo didn't work… Let's check the trace!
  • 36. If query tracing demo didn't work… 91% of the time was spent on vmselect while aggregating 9.4k series, 13Mil data samples!
  • 37. How to improve query speed? 1. Add more resources to monitoring. 2. Or… be smarter about data!
  • 39. If cardinality explorer demo didn't work…
  • 40. If cardinality explorer demo didn't work…
  • 41. If cardinality explorer demo didn't work…
  • 42. Cardinality explorer: summary VictoriaMetrics allows exploring time series cardinality to identify: ● Metric names with the highest number of series ● Labels with the highest number of series ● Values with the highest number of series for the selected label ● label=name pairs with the highest number of series ● Labels with the highest number of unique values ➔ Available built-in in VictoriaMetrics components ➔ Supports specifying Prometheus URL
  • 43. Streaming aggregation vs Recording rules The number of time series stored in TSDB is Data-in + Recording Rules results
  • 44. Streaming aggregation vs Recording rules The number of time series stored in TSDB is only what needs to be persisted
  • 45. How to use streaming aggregation - match: "grpc_server_handled_total" # time series selector interval: "2m" # on 2m interval outputs: ["total"] # aggregate as counter without: ["grpc_method"] # group without label Result: grpc_server_handled_total:2m_without_grpc_method_total
  • 46. How to use streaming aggregation https://play.victoriametrics.com
  • 47. Streaming aggregation: summary 1. Aggregate incoming samples in streaming mode before data is written to remote storage 2. Aggregation is applied to all the metrics received via any supported data ingestion protocol and/or scraped from Prometheus-compatible targets 3. Statsd alternative 4. Recording rules alternative 5. Reducing the number of stored samples 6. Reducing the number of stored series 7. Compatible with tools supporting Prometheus remote write protocol
  • 52. Complexity penalty ● Complex systems are harder to maintain ● Complex systems are harder to educate about ● Complex systems are more expensive to scale
  • 53. Additional materials 1. Snapshot of Grafana dashboard from the benchmark 2. Benchmark repo for reproducing the test 3. Save network costs with VictoriaMetrics remote write protocol 4. VictoriaMetrics: achieving better compression than Gorilla for time series data 5. Streaming aggregation 6. VictoriaMetrics playground