A presentation that outlines the different aspects of cross-team collaboration among different engineering teams when handling incidents. It looks at different tools and concepts along the phases of Resolving, Preventing and Discovering Incidents
21. OpenTelemetry
- CNCF project
- Vendor neutral
- Merged OpenTracing and OpenCencus
- Specifications, protocols, API’s and SDKs
- No telemetry data backend!
22. But what to instrument?
The meaning of SLOs?
How to write queries, if I like to
understand the metrics that I am
collecting?
Questions raising from the Application
Developers
Auto Instrumentation?
Handling incidents collaboratively is like solving a rubix cube
Understanding the business outcome and the overall functionality of a system consisting of distributed services and the infrastructure components to run them at scale is almost like solving a Rubix cube. Once an incident occurs, it is not enough to look at the single side of a rubix cube. In order to solve the puzzle, all sides of the cube need to be considered.
Monitoring a distributed system should not be the single effort of a single engineering team. Observability should be a goal for all engineering teams.
Nevertheless, it is often a mantra just for SRE teams.
Coming from the perspective of an application engineer, I will outline how an application engineer benefits from understanding infrastructure and common incidents and how SRE teams can benefit from understanding common failures when talking about the application code. Let’s take a deeper look at what collaboration across different engineering teams means and how it supports the process of resolving the rubix cube together.
The side of the application developers (backend and frontend)
They live in their IDE
The side of an SRE
Sides where different developer groups meet
Getting an order of things that have happen, different engineering team might look at different tools for that, that is fine but well when trouble is around the corner it is nice if everyone looks at the same picture
Best practices and documentation can help! Reading it helps even more…
Heavy text based? Not a good idea
Solution: Setting retries: either in Service Mesh or in Backend code, both is possible but the importance is the retrz strategy needs to be communicated among different teams
Best practices and documentation can help! Reading it helps even more…
Heavy text based? Not a good idea