A practical introduction
to observability
Nikolay Stoitsev
Engineering Manager @ Halo DX
Monitoring
Monitoring
Logging
Monitoring
Logging
Distributed Tracing
Monitoring
Monitoring system components
Application
Application
Application
Monitoring
System
Time Series Database
Dashboard
Monitoring system components
Application
Application
Application
Monitoring
System
Time Series Database
Dashboard
Prometheus, Graphite, m3db
Monitoring system components
Application
Application
Application
Monitoring
System
Time Series Database
Dashboard
Prometheus UI, Grafana
Counter
Counter increase
Timer
Labels
What to watch out for?
Cardinality
Cardinality
● search.success, app_version=1, type=Patient
● search.success, app_version=1, type=Exam
● search.success, app_version=2, type=Patient
● search.success, app_version=2, type=Exam
#1. Don’t add high
cardinality tags
Metrics are not accurate
● DB engine optimizes for faster operations
● When performing some operations for a different time resolution
● When archiving metrics for long term storage
#2. Don’t rely on metrics
infrastructure for BI
Don’t use average values
● Averages hide the
outliers
● Doesn’t represent
typical behavior
Use percentiles
● Represents the
worst experience in
90% of the time
● Can measure p90,
p95, p99
p90
Histograms
● Shows the whole
distribution
● Configurable
buckets
#3. Use percentiles or
histograms
Example alert
Alert Levels
Send Slack/Teams Message
Alert Levels
Send alert to oncall
Alerting tool is usually built
into the metrics system
Alerts should be
● urgent
● important
● actionable
● real
Should represent either
ongoing or imminent
problems
What to watch out for?
1. Better to remove an alert
when it’s noisy
#2. Use success rate
Symptom-based monitoring
● Number of 5xx HTTP response codes
● Response time
● Email sending is not working
● Users can’t log in
Cause-based monitoring
● Free disk space on database server
● Memory utilisation
● Free file descriptors
Many causes may trigger a
symptom
User impact is most
important
#3. Focus on
symptom-based alerts
Cause-based alerts are
also necessary
Picking alerts to start with
Front-end
Load
Balancer
Back-end DB
Count rate of
successful
log-in
Count
request
success rate
Logging
Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Logstash, Fluentd
Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Elasticsearch, Loki
Logging system
Application
Application
Application
Log
Aggregation
Database
Dashboard
Log
Collector
Log
Collector
Log
Collector
Kibana
Log messages
Finding logs
Can search by:
● content of log message
message : *notification*
● all logs from a service
kubernetes.labels.app/name.keyword : "api-gateway"
● many more thanks to flexible query schema
What to watch out for?
#1. Use appropriate log
level - info, warn, error
Structured logging
● Append useful key=value pairs
● Can group (aggregate) by the keys
● Can sort by aggregations
#2. Use structured logging
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Reduce log
retention period
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Cold Storage
Query UI
#3. Use proper retention
period or cold storage
Distributed tracing
https://www.youtube.com/watch?v=rM1z7Q1TxR0
End-to-end summary
1. Configure automated alerts
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
3. Use structured logging to find the root cause of the problem easily
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
3. Use structured logging to find the root cause of the problem easily
4. Fix problems and make sure all metrics are always back to normal
Thank you! Q&A
Nikolay Stoitsev
Engineering Manager at Halo DX
Photo by Pixabay, Şahin Sezer Dinçer, Andrea Piacquadio, Ian Beckley from Pexels

A practical introduction to observability