How to reduce expenses on monitoring

How to reduce expenses on monitoring
with VictoriaMetrics
Roman Khavronenko | github.com/hagen1778

Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778

What this talk is about
1. Best ways for storing and processing metrics
2. Open source tools only
3. For people familiar with Prometheus,
Thanos, Mimir, VictoriaMetrics

You can either have a faster car…
…or be a smarter driver!

What can you get from simple replacing?

Prometheus remote-write benchmark

Prometheus vs VictoriaMetrics benchmark

# the number of nodeexporter instances to scrape
targetsCount: 1000
# how frequently to scrape nodeexporter targets
scrapeInterval: 15s
# rules evaluation interval
# https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1
queryInterval: 30s
# scrapeConfigUpdatePercent is a churn rate generated once
# per scrapeConfigUpdateInterval
scrapeConfigUpdatePercent: 5
scrapeConfigUpdateInterval: 10m
Prometheus vs VictoriaMetrics benchmark

Summary after 7d benchmark (1k nodeexporter targets)
Prometheus:
CPU avg used: 0.79 / 3 cores
Disk occupied: 83.5 GiB
Mem max used: 8.12 GiB / 12 GiB
Read latency avg:
50th - 70.5ms
99th - 7s
VictoriaMetrics:
CPU avg used: 0.76 / 3 cores
Disk occupied: 33 GiB
Mem max used: 4.5 GiB / 12 GiB
Read latency avg:
50th - 4.3ms
99th - 3.6s

Improving network compression
1. Increase compression level, trade CPU for network savings:
a. -remoteWrite.vmProtoCompressLevel
2. Increase batch size, trade latency for compression:
a. -remoteWrite.maxBlockSize
b. -remoteWrite.maxRowsPerBlock
c. -remoteWrite.flushInterval
3. Reduce entropy to improve compression:
a. -remoteWrite.significantFigures
b. -remoteWrite.roundDigits

Keeping only signiﬁcant ﬁgures
instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781
instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236
rules:
- record: instance:cpu_utilization:ratio_avg
expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])

Keeping only significant figures
Applying --vm-significant-figures=8 to recording rules
0.05055757575781
0.050557576
changed compression ratio from 1.2B to 0.8B per sample
See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus

Understanding the data - query tracing
VictoriaMetrics supports query tracing for detecting bottlenecks during query processing.
This is like EXPLAIN ANALYZE from Postgresql!

https://play.victoriametrics.com
Query tracing demo!

If query tracing demo didn't work…
Typical query takes 4s to execute… Why?

Let's check the trace!

91% of the time was spent on vmselect while aggregating
9.4k series, 13Mil data samples!

How to improve query speed?
1. Add more resources to monitoring.
2. Or… be smarter about data!

Cardinality explorer demo!

If cardinality explorer demo didn't work…

Cardinality explorer: summary
VictoriaMetrics allows exploring time series cardinality to identify:
● Metric names with the highest number of series
● Labels with the highest number of series
● Values with the highest number of series for the selected label
● label=name pairs with the highest number of series
● Labels with the highest number of unique values
➔ Available built-in in VictoriaMetrics components
➔ Supports specifying Prometheus URL

Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is Data-in + Recording Rules results

Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is only what needs to be persisted

How to use streaming aggregation
- match: "grpc_server_handled_total" # time series selector
interval: "2m" # on 2m interval
outputs: ["total"] # aggregate as counter
without: ["grpc_method"] # group without label
Result:
grpc_server_handled_total:2m_without_grpc_method_total

How to use streaming aggregation

Streaming aggregation: summary
1. Aggregate incoming samples in streaming mode before data is written to remote
storage
2. Aggregation is applied to all the metrics received via any supported data
ingestion protocol and/or scraped from Prometheus-compatible targets
3. Statsd alternative
4. Recording rules alternative
5. Reducing the number of stored samples
6. Reducing the number of stored series
7. Compatible with tools supporting Prometheus remote write protocol

Complexity penalty
● Complex systems are harder to maintain
● Complex systems are harder to educate about
● Complex systems are more expensive to scale

Additional materials
1. Snapshot of Grafana dashboard from the benchmark
2. Benchmark repo for reproducing the test
3. Save network costs with VictoriaMetrics remote write protocol
4. VictoriaMetrics: achieving better compression than Gorilla for time series data
5. Streaming aggregation
6. VictoriaMetrics playground

Questions?
● https://github.com/VictoriaMetrics
● https://github.com/hagen1778

How to reduce expenses on monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to reduce expenses on monitoring

Similar to How to reduce expenses on monitoring (20)

Recently uploaded

Recently uploaded (20)

How to reduce expenses on monitoring