A practical introduction to observability

A practical introduction
to observability
Nikolay Stoitsev
Engineering Manager @ Halo DX

Monitoring
Logging
Distributed Tracing

Monitoring system components
Application
Application
Application
Monitoring
System
Time Series Database
Dashboard

Application
Application
Application
Monitoring
System
Dashboard
Prometheus, Graphite, m3db

Application
Application
Application
Monitoring
System
Dashboard
Prometheus UI, Grafana

Cardinality
● search.success, app_version=1, type=Patient
● search.success, app_version=1, type=Exam
● search.success, app_version=2, type=Patient
● search.success, app_version=2, type=Exam

#1. Don’t add high
cardinality tags

Metrics are not accurate
● DB engine optimizes for faster operations
● When performing some operations for a different time resolution
● When archiving metrics for long term storage

#2. Don’t rely on metrics
infrastructure for BI

Don’t use average values
● Averages hide the
outliers
● Doesn’t represent
typical behavior

Use percentiles
● Represents the
worst experience in
90% of the time
● Can measure p90,
p95, p99
p90

Histograms
● Shows the whole
distribution
● Conﬁgurable
buckets

#3. Use percentiles or
histograms

Alert Levels
Send Slack/Teams Message

Alert Levels
Send alert to oncall

Alerting tool is usually built
into the metrics system

Alerts should be
● urgent
● important
● actionable
● real

Should represent either
ongoing or imminent
problems

1. Better to remove an alert
when it’s noisy

Symptom-based monitoring
● Number of 5xx HTTP response codes
● Response time
● Email sending is not working
● Users can’t log in

Cause-based monitoring
● Free disk space on database server
● Memory utilisation
● Free ﬁle descriptors

Many causes may trigger a
symptom

#3. Focus on
symptom-based alerts

Cause-based alerts are
also necessary

Picking alerts to start with
Front-end
Load
Balancer
Back-end DB
Count rate of
successful
log-in
Count
request
success rate

Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Logstash, Fluentd

Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Elasticsearch, Loki

Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Kibana

Finding logs
Can search by:
● content of log message
message : *notiﬁcation*
● all logs from a service
kubernetes.labels.app/name.keyword : "api-gateway"
● many more thanks to ﬂexible query schema

#1. Use appropriate log
level - info, warn, error

Structured logging
● Append useful key=value pairs
● Can group (aggregate) by the keys
● Can sort by aggregations

Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard

Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Reduce log
retention period

Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Cold Storage
Query UI

#3. Use proper retention
period or cold storage

Distributed tracing
https://www.youtube.com/watch?v=rM1z7Q1TxR0

End-to-end summary
1. Conﬁgure automated alerts

End-to-end summary
2. Use metrics and tracing to pinpoint the problem

End-to-end summary
3. Use structured logging to ﬁnd the root cause of the problem easily

End-to-end summary
3. Use structured logging to ﬁnd the root cause of the problem easily
4. Fix problems and make sure all metrics are always back to normal

Thank you! Q&A
Nikolay Stoitsev
Engineering Manager at Halo DX
Photo by Pixabay, Şahin Sezer Dinçer, Andrea Piacquadio, Ian Beckley from Pexels

A practical introduction to observability

More Related Content

What's hot

Similar to A practical introduction to observability

More from Nikolay Stoitsev

Recently uploaded

A practical introduction to observability