4 examples of real-worlds Kubernetes Observability Platforms that empower developers, and the principles behind them.
As a platform/ops/SRE team, what can you do to give developers real ownership of their Kubernetes alerts, and empower them to investigate and solve production issues on their own.
What does a good shared responsibility model look like?
Applying Platform Engineering Thinking to Observability.pdf
1. How to empower developers so they own
observability - and is it a good idea?
1
2. About Me
Natan Yellin
Cofounder @ Robusta.dev
Maintainer of open source projects for developer
enablement, around K8s observability
(Robusta, KRR, Kubernetes ChatGPT Bot)
5. The Three Approaches to Ops
The IT Approach
We will do it for you
The "Sink or Swim" Approach
You build it, you run it
The Relationship Approach
Shared responsibility model
Assisted ownership
5
11. Gives Control
Example: Slack Gaz
"A visual interface to check
your services, see where they
are deployed, view their
status..."
"The idea behind Debug
Actions is to create a set of
commands that are approved
to be run in your pod and
triggered from the interface."
Source: Slack Blog
12. How does this fit into broader
Platform Engineering initiatives?
12
13. Path to Production on Kubernetes
Artifacts & Registry
Create
environments
Cost/Utilization
Tuning
Containerize it
Packaging (Helm /
Kustomize / Argo)
Setup alerts and
dashboards
Troubleshooting
issues
Create K8s
Manifests
Write app
Security checkpoint
13
14. Recommended Ownership Grid
Dev
Ops
Required
Enablement
Alerting on Nodes and K8s Infra ❌
Complete ownership
Defining Alerts (workloads) Per-app customization
Default alerts
Templates and examples
so devs don't need to know
PromQL
Investigating Alerts (workloads)
Actual investigatio and per-app
customization of dashboards
Default dashboards
Prebuilt dashboards. Per-
app dashboards
Right-sizing and cost-saving Applying recommendations
Providing recommendations
Actionable
recommendations, not
dashboard you must
"remember to look at"
Responding to Alerts (workloads) Alerting infrastructure SLA on responding to issues
Team-based routing, giving
devs safe prod access
14
16. How to reach me
LinkedIn - Natan Yellin
Twitter - aantn
Platform Engineering Slack - @Natan Yellin
Feel free to ask questions about the slides, or just say hi :)
16