Handling incidents collaboratively is like
solving a rubik’s cube
More complex Communication
Worker 1
POD 2
POD 1
Worker 2 POD 2
Kubernetes
Master
API
Server
@nele_lea
@nlea@social.anoxion.de
@nlea
Resolve
Understanding Causality
Random
fact
1. Understanding 2. Fixing
- Retry
- Restart
- Bringing back on old version
Defining a Workflow
Prevent
Photo by Scott Sanker on Unsplash
Retry Strategy
Documentation
Best Practise
Discover
Make your System Observable
Telemetry Data
- Logs
- Metrics
- Traces
Observability
Application
Instrument Query-,
alerting-,
visualization
Platform
Observibility
data backend
OpenTelemetry
- CNCF project
- Vendor neutral
- Merged OpenTracing and OpenCencus
- Specifications, protocols, API’s and SDKs
- No telemetry data backend!
But what to instrument?
The meaning of SLOs?
How to write queries, if I like to
understand the metrics that I am
collecting?
Questions raising from the Application
Developers
Auto Instrumentation?
Function Level Metrics
https://github.com/autometrics-dev
Autometrics
Instrument
Metrics BackEnd
latency, error- and request
rate
Application
Functions
Query-,
alerting-,
visualization
Platform
Demo
Kubetrain.io
berlin@kubetrain.io
🚂
@nele_lea
@nlea@social.anoxion.de
@nlea

Handling Incidents collaboratively is like solving a Rubik's cube.pptx

Editor's Notes

  • #2 Handling incidents collaboratively is like solving a rubix cube Understanding the business outcome and the overall functionality of a system consisting of distributed services and the infrastructure components to run them at scale is almost like solving a Rubix cube. Once an incident occurs, it is not enough to look at the single side of a rubix cube. In order to solve the puzzle, all sides of the cube need to be considered. Monitoring a distributed system should not be the single effort of a single engineering team. Observability should be a goal for all engineering teams. Nevertheless, it is often a mantra just for SRE teams. Coming from the perspective of an application engineer, I will outline how an application engineer benefits from understanding infrastructure and common incidents and how SRE teams can benefit from understanding common failures when talking about the application code. Let’s take a deeper look at what collaboration across different engineering teams means and how it supports the process of resolving the rubix cube together.
  • #7 The side of the application developers (backend and frontend) They live in their IDE
  • #8 The side of an SRE
  • #9 Sides where different developer groups meet
  • #12 Getting an order of things that have happen, different engineering team might look at different tools for that, that is fine but well when trouble is around the corner it is nice if everyone looks at the same picture
  • #14 Best practices and documentation can help! Reading it helps even more… Heavy text based? Not a good idea
  • #16 Solution: Setting retries: either in Service Mesh or in Backend code, both is possible but the importance is the retrz strategy needs to be communicated among different teams
  • #17 Best practices and documentation can help! Reading it helps even more… Heavy text based? Not a good idea
  • #18 Different shades of architecture and services