Observability (o11y):
a gentle introdcution
Bram Vogelaar
@attachmentgenie
We have all been there…x
And that's when we show Bill…
Amongst other things…
$ whoami
• Used to be a Molecular Biologist
• Then became a Dev
• Now an Ops
• Currently Cloud Engineer @ The Factory
SNMP pulls hundreds of checks, logs live in databases
Push thousands of time-series to Graphite + statsd , Logstash for
logs
Prometheus scrapes millions of metrics, Logs are everywhere,
Traces are a thing now
How it started -> How is it going
Monitoring Symptoms
The monitoring everything age
The act of observing changes the system being observed
Base CPU load significantly visible
aka HDDs and CPUs killed by monitoring love
Aim becomes 100% uptime
Monitoring solutions were not catching up
While Applications were deployed 10x a day, we were doing
manual configuration of Single Source of Truth
Eventual consistency with huge latency
Causing Chaos and Distrust in the Observer
Pager overload
“Observability is defined as a measure of how
well internal states of a system can be inferred
from knowledge of its external outputs.”
Engineering Definition
R.E.D & U.S.E
U = Utilization, as canonically defined
S = Concurrency
E = Error Rate, as a throughput metric
R = Request Throughput, in requests per second
E = Request Error Rate, as either a throughput metric or a
fraction of overall throughput
D = Latency, Residence Time, or Response Time; all three are
widely used
Logs are insurance
Large complex systems will always be in a
(somewhat) degraded state
3 Pillars Definition
Get engineers to be comfortable facing failure
If it hurts do it more often…
Don’t stare at dashboards
monitoring – luck == Observability
Cultural Definition
SL{O,I,A}
Service-Level Agreement (SLA) => Contract
Service-Level Objective (SLO) => Promise
Service-Level Indicator (SLI) => Observe
Only alert on SLO violations
And definitely on HDD space useage
Don’t alert on symptoms
Do alert on the rate you are using the error budget
What we need to visualize
$/€ revenue
# of sales
# of signups
# of Api calls
“(Ab)Normal” application use
Instrumentation
OpenTelemetry => exiting times
monitoring – luck == Observability
monitoring + devops == Observability
Bram’s Definition
Contact
bram@attachmentgenie.com
@attachmentgenie
https://www.slideshare.net/attachmentgenie
Questions ?
The Floor is yours…

Observability; a gentle introduction