OSMC 2023 | Large-scale logging made easy by Alexandr Valialkin

Large-scale logging
made easy
Aliaksandr Valialkin, CTO at VictoriaMetrics
Open source
monitoring conference
2023

Logging? What’s it?
log line

timestamp

location

The purpose of logging: debugging
● Which errors have occurred in the app during the last hour?

● Why the app returned unexpected response?

● Why the app wasn’t working correctly yesterday?

● Why the app wasn’t working correctly yesterday?
● What the app was doing at the particular time range?

The purpose of logging: security
● Who dropped the database in production?

● Which IP addresses were used for logging in as admin during the last hour?

● Who performed a particular action at the given time?

● Who performed a particular action at the given time?
● How many failed login attempts were during the last day?

The purpose of logging: stats and metrics
● How many requests were served per hour during the last day?

● How many unique users were accessing the app during the last month?

● How many requests were served for a particular IP range yesterday?

● What percentage of requests finished with errors during the last hour?

● What percentage of requests finished with errors during the last hour?
● What was the 95th percentile of request duration for the given web page
yesterday?

Traditional logging
● Save logs to files on the local filesystem

Traditional logging
● Save logs to files on the local filesystem
● Use command-line tools for log analysis: cat, grep, awk, sort, uniq, head, tail,
etc.

Traditional logging: advantages
● Easy to setup and operate

● Easy to debug

● Easy to debug
● Easy to analyze logs with command-line tools and bash scripts

● Easy to debug
● Easy to analyze logs with command-line tools and bash scripts
● Works perfectly for 50 years (since 1970th)

Traditional logging: disadvantages
● Hard to analyze logs from hundreds of hosts

● Hard to analyze logs from hundreds of hosts (hello, Kubernetes and
microservices)

microservices)
● Slow search speed over large log files (e.g. 1TB log file may require a hour to
scan)

microservices)
● Slow search speed over large log files (e.g. 1TB log file may require a hour to
scan)
● Imperfect support for structured logging (e.g. logs with arbitrary labels)

The solution: large-scale logging

Large-scale logging: core principles
● Push logs from large number of apps to a centralized system

● Provide fast queries over all the ingested logs

● Provide fast queries over all the ingested logs
● Support structured logging

Large-scale logging: solutions

● Cloud (DataDog, Sumo Logic, New Relic, etc.)

● Cloud (DataDog, Sumo Logic, New Relic, etc.)
● On-prem (Elasticsearch, OpenSearch, Grafana Loki, VictoriaLogs, etc.)

Large-scale logging: cloud vs on-prem

Large-scale logging: operational complexity
● Cloud: easy - cloud provider operates the system

Large-scale logging: operational complexity
● Cloud: easy - cloud provider operates the system
● On-prem: harder - you need to setup and operate the system

Large-scale logging: security
● Cloud: questionable - who has access to your logs?

Large-scale logging: security
● Cloud: questionable - who has access to your logs?
● On-prem: good - your logs are under your control

Large-scale logging: price
● Cloud: very expensive (millions of €)

Large-scale logging: price
● Cloud: very expensive (millions of €)
● On-prem: depends on the cost efficiency of the used system

Large-scale logging: on-prem comparison

Large-scale logging: on-prem: setup and operation
● Elasticsearch: hard because of non-trivial indexing configs for logs

● Grafana Loki: hard because of microservice architecture and complex configs

● Grafana Loki: hard because of microservice architecture and complex configs
● VictoriaLogs: easy because it runs out of the box from a single binary with
default configs

Large-scale logging: on-prem: costs
● Elasticsearch: high - it needs a lot of RAM and disk space

● Grafana Loki: medium - it needs a lot of RAM for high-cardinality labels

● Grafana Loki: medium - it needs a lot of RAM for high-cardinality labels
● VictoriaLogs: low - a single VictoriaLogs instance can replace a 30-node
Elasticsearch or Loki cluster

Large-scale logging: on-prem: full-text search support
● Elasticsearch: yes, but needs proper index configuration

● Grafana Loki: yes, but very slow

● Grafana Loki: yes, but very slow
● VictoriaLogs: yes, works out of the box for all the ingested log fields and
labels without additional configs

Large-scale logging: on-prem: how to efficiently query
100TB of logs?
● Elasticsearch: to run a cluster with 200TB of disk space and 6TB of RAM.
Infrastructure costs at GCE or AWS: ~€50K/month

100TB of logs?
● Grafana Loki: impossible because the query will take hours to execute

100TB of logs?
● Grafana Loki: impossible because the query will take hours to execute
● VictoriaLogs: to run a single node with 6TB of disk space and 200GB of RAM.

Large-scale logging: on-prem: integration with CLI tools
● Elasticsearch: poor

● Grafana Loki: poor

● Grafana Loki: poor
● VictoriaLogs: excellent

VictoriaLogs for large-scale logging
● Satisfies requirements for large-scale logging
○ Efficiently stores logs from large number of distributed apps
○ Provides fast full-text search
○ Supports both structured and unstructured logs

VictoriaLogs for large-scale logging
● Satisfies requirements for large-scale logging
○ Efficiently stores logs from large number of distributed apps
○ Provides fast full-text search
○ Supports both structured and unstructured logs
● Provides traditional logging features
○ Ease of use
○ Great integration with CLI tools - grep, awk, head, tail, less, etc.

VictoriaLogs: CLI integration
with demo

Which errors have occurred in all the apps during the last hour?
_time:1h error

_time:1h error
LogsQL query

_time:1h error
Filter on log timestamp:
select logs for the last hour

_time:1h error
Word filter: select all
logs with “error” word

Simple bash wrapper
around curl

LogsQL query

Plain old CLI tools
connected via Unix pipes

The result can be saved to file at any stage with
“… > response_file”
for later analysis

JSON lines

Log message

Log stream (aka app instance)

Log timestamp

Other log fields can be requested if needed

DEMO

Show only log messages
jq -r ._msg

How many errors have occurred during the last hour?

Plain old “wc -l”

The number of logs with
“error” word

DEMO

Which apps generated the most of errors during the last
hour?

hour?
Traditional bash-fu

hour?
Get _stream field from
every JSON line

hour?
Sort _stream values

hour?
Count the number of
unique _stream values

hour?
Sort counts of unique _stream
values in reverse order

hour?
Return top 8 _stream values with
the highest number of counts

hour?
_stream values

hour?
_stream counts

hour?
DEMO

Fluentbit-gke errors during the last hour

_stream filter: select logs with
kubernetes_container_name=”fluentbit-gke”

kubernetes_container_name=”fluentbit-gke”

DEMO

The number of per-minute errors for the last 10 minutes

select _time field from JSON lines

Trim _time values to minutes

Sort _time values

Count unique _time values

_time values trimmed to minute

The number of logs for the given minute

DEMO

Non-200 status codes for the last week

Find logs with “status=” phrase, but
without “status=200” phrase

DEMO

Top client IPs for the last 4 weeks with 400 or 404
response status codes

Find logs with “remote_addr=”
phrase

Find logs with “remote_addr=”
phrase
… and with “status=404” or “status=400”
phrases

extract IP address from remote_addr=...

drop “remote_addr=” prefix

DEMO

Per-day stats for the given IP during the last 10 days

Search for log messages with the given IP

A bit of bash-fu: extract log timestamp, cut it to
days and calculate the number of per day entries

DEMO

Per-level stats for the last 5 days, excluding info logs

Select logs where “level” field isn’t equal to “info”,
“INFO” or an empty string

DEMO

System for large-scale logging
MUST provide
excellent CLI integration

Do not like CLI and bash? Then use web UI!

VictoriaLogs: (temporary) drawbacks

● Missing data extraction and advanced stats functionality in LogsQL

● Missing data extraction and advanced stats functionality in LogsQL (but it can
be replaced with traditional CLI tools as we seen before)

● Missing cluster version

● Missing cluster version (but a single-node VictoriaLogs can replace a 30-node
Elasticsearch or Loki cluster)

● Missing integration with Grafana

● Missing integration with Grafana (but there is own web UI, which is going to
be better than Grafana for logs)

VictoriaLogs: recap

VictoriaLogs: recap
● The lowest RAM usage and disk space usage (up to 30x less than
Elasticsearch and Grafana Loki)

VictoriaLogs: recap
● Fast full-text search

VictoriaLogs: recap
● Excellent integration with traditional command-line tools for log analysis

VictoriaLogs: recap
● Accepts logs from all the popular log shippers (Filebeat, Fluentbit, Logstash,
Vector, Promtail)

VictoriaLogs: recap
● Accepts logs from all the popular log shippers (Filebeat, Fluentbit, Logstash,
Vector, Promtail)
● Open source and free to use!

VictoriaLogs: useful links
● General docs - https://docs.victoriametrics.com/VictoriaLogs/

VictoriaLogs: useful links
● General docs - https://docs.victoriametrics.com/VictoriaLogs/
● LogsQL docs - https://docs.victoriametrics.com/VictoriaLogs/LogsQL.html

OSMC 2023 | Large-scale logging made easy by Alexandr Valialkin

More Related Content

What's hot

Similar to OSMC 2023 | Large-scale logging made easy by Alexandr Valialkin

Recently uploaded

OSMC 2023 | Large-scale logging made easy by Alexandr Valialkin