SlideShare a Scribd company logo
1 of 50
Download to read offline
Introducing Distributed Tracing in
a Large Software System
Kostiantyn Sharovarskyi
Software engineer, specializing in .NET
I work on systems at Jooble — jobs
aggregator which helps people find jobs
Why should I care about
Distributed Tracing?
How [use-case] works?
I want to understand how a use-case works
How [use-case] works?
I find the service codebase
service
service
How [use-case] works?
I find an interesting slice of the code
service
service
How [use-case] works?
I find that the use-case contains a db call
service
service db call
db 1
How [use-case] works?
I find that the use-case contains a service call
Now I need to go to another service to see what it does
Repeat…
service
service
db call
HTTP call
db 1
How [use-case] works?
Distributed Tracing Alternative
I find the identifier of a certain request in PROD (trace ID)
How [use-case] works?
Distributed Tracing Alternative
I can see the whole picture
service
service 2
db call
db 1
db 2
message
passing
db call
HTTP call
How [use-case] works?
Distributed Tracing Alternative
I can see the whole picture with all the details
of this exact process in action in PROD
service
service 2
db call
db 1
db 2
message
passing
db call
HTTP call
host: SERVER 1
endpoint: /endpoint
time: 1s
userID: 999
host: SERVER 4
time: 0.25s
host: SERVER 3
time: 0.25s
queue: queue_1
host: SERVER 2
endpoint: /endpoint2
time: 0.5s
Tracking application requests as they
flow between services, to messaging
systems, databases, etc.
One can call it:
Debugging for distributed systems
What is Distributed Tracing?
About company
Jooble is №2 most popular
job search aggregator
1 bil
visits annually
140K
resources
69
countries
2006
year of founding
25
languages
according to SimilarWeb
Jooble’s mission is to help people find work
How did Distributed Tracing help us? Case 1
How did Distributed Tracing help us?
How to find the root issue?
Traditionally, we can
Or
Go to the code, read it and create
hypotheses about why this could happen
Add additional logs, add more logs continuously
until we understand what is going on
Look at the trace to see where
is the root of the problem
We have a salary page that shows salary information
for various jobs and regions. For some jobs in some
countries, this page showed a 404 page.
Problem
Case 1
How did Distributed Tracing help us?
404 response from the frontend-serving service
Problem
Look at the trace of the call
Step
Timeout from an underlying service lead the service to believe that there is no data, hence 404
Case 1
How did Distributed Tracing help us?
404 response from the frontend-serving service
Problem
Add more operations to the trace to pinpoint the culprit
Step
Case 1
How did Distributed Tracing help us?
404 response from the frontend-serving service
Problem
Zoom into the code and fix the problem. Profit!
Step
Hot loop was doing a lookup with Linq O(n^2) instead of a dictionary-based O(n)
Case 1
How did Distributed Tracing help us?
Google uses various metrics to understand how
user-friendly a site is. Better metrics mean better
rankings in the search engine. One of the metrics is
focused on site performance — LCP, or load speed.
How can we improve our load speed?
Understand the full picture of what is going on in the
backend by inspecting the trace of a user request
Context
Problem
Tracing provides us a new tool:
Case 2
How did Distributed Tracing help us?
How can we improve our load speed?
Problem
We found
Multiple requests for the same data
Duplicate requests from frontend Server Side
Rendering (SSR)
Serial requests where they
can be made concurrently
Case 2
OpenTelemetry as the backbone
of Distributed Tracing
OpenTelemetry is an observability framework is a collection
of tools, SDKs, documentation etc. for all things observability
I find that it provides
Ubiquitous language for the tracing concepts
Standardized ways to collect, send and sample traces
Interoperable implementation in various tech stacks
Distributed Tracing primitives Span
Represents an operation Implemented via Activity class
Distributed Tracing primitives Trace
Records the paths taken by
requests propagated through
multi-service architectures
Collection of Activities with the
same TraceId. TraceId is usually
randomly generated, or passed
from the parent operations
Distributed Tracing primitives Attribute
Attributes are key-value pairs
that contain metadata you
can attach to a Span
Attributes are represented as
Activity Tags. You may attach
any tag at any point in the
lifetime of the Activity
Distributed Tracing primitives
service 1 service 2
HTTP call
Distributed Tracing primitives
trace
call span
receive span
service: service
attr1: value
attr2: value
service: service2
attr1: value
attr2: value
Very often, you don’t use those primitives
yourself, it is already done for many popular
libraries — SqlClient, Redis, Hangfire, PostgreSQL
etc. It is very easy to add these libraries to your
tracing setup
Distributed Tracing primitives
How to propagate Trace information?
version trace ID parent ID trace flags
(isSampled)
service 1 service 2
HEADERS:
traceparent: 00-xx-xx-01
How to propagate Trace information? HTTP calls
How to propagate Trace information? HTTP calls
Enabling propagation in
.NET
Client side
Client side Server side
OpenTelemetry.Instrumentation.Http OpenTelemetry.Instrumentation.AspNetCore
How to propagate Trace information? Messaging
It is important to make tracing easy-to-add, so it is
best to provide a way to add tracing for the most
used library.
For messaging, we at Jooble use RabbitMQ via the
EasyNetQ library and at the point of writing, there
is no built-in support for OpenTelemetry tracing.
What can we do?
Amazon SQS
Google Cloud Pub/Sub
RabbitMQ
How to propagate Trace information?
Find extension points
Messaging
How to propagate Trace information?
Wrap the library methods with your own instrumentation
Messaging
How to propagate Trace information?
Extract into a library
Messaging
How to propagate Trace information?
In-process background workers
Imagine you offload some processing to background threads.
How to preserve the trace information?
Sampling as a source of confusion
Storing all traces may be infeasible due
to storage concerns.
Sampling is a strategy on how to choose
which traces to store and which to drop.
version trace ID parent ID trace flags
(isSampled)
Choose one strategy and stick to it:
Head sampling (decide on sampling when trace is started)
Tail sampling (decide on sampling after the trace is done)
Sampling as a source of confusion
tip 1 tip 2
tip 3
Understand and communicate
your sampling strategy
Give a way to force
a sampling decision
Use a much more lenient sampling
strategy on test environments (e.g.
record all traces)
● What service is the first to start the trace and
decide on whether to record the trace?
● What proportion of traces are sampled?
● How to change the number of traces to be
sampled?
● How to understand if the trace was sampled?
Choosing a Tracing backend
services traces collection
traces storage+
querying
Choosing a tracing backend - is an architectural decision.
Delaying architectural decisions is a useful skill.
Probability of change of Tracing Backend >>> Collection mechanism
Decouple services from tracing backend via OpenTelemetry collector - middleware
that can redirect traces to 1 or more tracing backends of choice.
Choosing a Tracing backend
● Сhange tracing backend and/or visualization tool without changing app code
● (Optional) Try out several backends at the same time
● (Optional) Configure tail sampling or other processors that mutate traces before
going to the backend
If you couple your apps to the collector, you can
services traces collection
traces storage+
querying
Choosing a Tracing backend
Look into the capabilities and limitations
of your organisation to decide on a
tracing backend
We at Jooble wanted
Utilise our own storage and compute capabilities
(why pay for things that we already have in our
datacenters)
Interoperation with other observability tools
that we already use
Performance that can handle our traffic
Choosing a Tracing backend
We chose Grafana Tempo
It can store traces on your own disks
It can be deployed to your hardware
The visualisation of traces is built in to Grafana
Claimed performance characteristics satisfied our needs
Choosing a Tracing backend
It proved to be a good choice because it went even
further.
Grafana Tempo now has a querying language
TraceQL that can query traces based on different
characteristics of spans in them
E.g. it allows us to find
Duplicate requests to a service
DB calls that are over a certain threshold
Traces that trigger a certain bug we investigate
Problems with choosing a
Cutting Edge solution - a story
At some point after upgrading to Tempo 2.0
search requests started to look like this
● Search request 1: 0.5s
● Search request 2: Bad Gateway
● Search request 3: 10.5s
● Search request 4: 0.5s
Problems with choosing a
Cutting Edge solution - a story
Error logs showed the next thing: After investigating Tempo Code (a very nice feature is
that Tempo is OpenSource), it is clear that it is
performing a recursive delete on the folder.
What’s going on?
error clearing completing block: unlinkat
/var/tempo/wal/{folderName}:
directory not empty
Problems with choosing a
Cutting Edge solution - a story
The culprit - NFS (Network File Storage) server-side silly rename mechanism.
If the file is opened on the server, the delete operation does not delete files, but
just renames them, postponing the delete operation until the file is closed.
Problems with choosing a
Cutting Edge solution - a story
Failed folder delete operations disrupted Tempo’s storage optimisation mechanism that then
wreaked havoc on search performance.
I filed an Pull Request that closes all files opened by Tempo, and this fixed things. Big thanks
to Tempo team for responding to my questions and helping getting the fix to the finish line.
Try tracing yourself
.NET BCL provides all the
required primitives
Variety of tracing backends and visualisation tools:
Grafana Tempo, Jaeger, Zipkin, Honeycomb etc
Support for many popular libraries
and frameworks is there
Introduction can be incremental
(one service at a time)
How to contact me? kostiantyn@sharovarskyi.com
k_sharovarskyi
Check out my website where I sometimes
post blog posts sharovarskyi.com
See open roles
We are hiring!
Explore open positions on our website
Thank you! Questions?

More Related Content

Similar to "Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi

Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaegerOracle Korea
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with JaegerInho Kang
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
Sumo Logic Quick Start - Feb 2016
Sumo Logic Quick Start - Feb 2016Sumo Logic Quick Start - Feb 2016
Sumo Logic Quick Start - Feb 2016Sumo Logic
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Randall Hunt
 
OORPT Dynamic Analysis
OORPT Dynamic AnalysisOORPT Dynamic Analysis
OORPT Dynamic Analysislienhard
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auevanbottcher
 
Observability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with SpringObservability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with SpringVMware Tanzu
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...IndicThreads
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Sumo Logic QuickStart
Sumo Logic QuickStartSumo Logic QuickStart
Sumo Logic QuickStartSumo Logic
 
OSMC 2023 | Journey to observability: tracking every function execution in pr...
OSMC 2023 | Journey to observability: tracking every function execution in pr...OSMC 2023 | Journey to observability: tracking every function execution in pr...
OSMC 2023 | Journey to observability: tracking every function execution in pr...NETWAYS
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin Kuberton
 

Similar to "Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi (20)

Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaeger
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with Jaeger
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
Sumo Logic Quick Start - Feb 2016
Sumo Logic Quick Start - Feb 2016Sumo Logic Quick Start - Feb 2016
Sumo Logic Quick Start - Feb 2016
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017
 
OORPT Dynamic Analysis
OORPT Dynamic AnalysisOORPT Dynamic Analysis
OORPT Dynamic Analysis
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
Observability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with SpringObservability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with Spring
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Sumo Logic QuickStart
Sumo Logic QuickStartSumo Logic QuickStart
Sumo Logic QuickStart
 
Distributed tracing
Distributed tracing Distributed tracing
Distributed tracing
 
OSMC 2023 | Journey to observability: tracking every function execution in pr...
OSMC 2023 | Journey to observability: tracking every function execution in pr...OSMC 2023 | Journey to observability: tracking every function execution in pr...
OSMC 2023 | Journey to observability: tracking every function execution in pr...
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 

More from Fwdays

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...Fwdays
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil TopchiiFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro SpodaretsFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym KindritskyiFwdays
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...Fwdays
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...Fwdays
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...Fwdays
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...Fwdays
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...Fwdays
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...Fwdays
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...Fwdays
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...Fwdays
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra MyronovaFwdays
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...Fwdays
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...Fwdays
 

More from Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Recently uploaded

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi

  • 1. Introducing Distributed Tracing in a Large Software System
  • 2. Kostiantyn Sharovarskyi Software engineer, specializing in .NET I work on systems at Jooble — jobs aggregator which helps people find jobs
  • 3. Why should I care about Distributed Tracing?
  • 4. How [use-case] works? I want to understand how a use-case works
  • 5. How [use-case] works? I find the service codebase service service
  • 6. How [use-case] works? I find an interesting slice of the code service service
  • 7. How [use-case] works? I find that the use-case contains a db call service service db call db 1
  • 8. How [use-case] works? I find that the use-case contains a service call Now I need to go to another service to see what it does Repeat… service service db call HTTP call db 1
  • 9. How [use-case] works? Distributed Tracing Alternative I find the identifier of a certain request in PROD (trace ID)
  • 10. How [use-case] works? Distributed Tracing Alternative I can see the whole picture service service 2 db call db 1 db 2 message passing db call HTTP call
  • 11. How [use-case] works? Distributed Tracing Alternative I can see the whole picture with all the details of this exact process in action in PROD service service 2 db call db 1 db 2 message passing db call HTTP call host: SERVER 1 endpoint: /endpoint time: 1s userID: 999 host: SERVER 4 time: 0.25s host: SERVER 3 time: 0.25s queue: queue_1 host: SERVER 2 endpoint: /endpoint2 time: 0.5s
  • 12. Tracking application requests as they flow between services, to messaging systems, databases, etc. One can call it: Debugging for distributed systems What is Distributed Tracing?
  • 13. About company Jooble is №2 most popular job search aggregator 1 bil visits annually 140K resources 69 countries 2006 year of founding 25 languages according to SimilarWeb Jooble’s mission is to help people find work
  • 14. How did Distributed Tracing help us? Case 1
  • 15. How did Distributed Tracing help us? How to find the root issue? Traditionally, we can Or Go to the code, read it and create hypotheses about why this could happen Add additional logs, add more logs continuously until we understand what is going on Look at the trace to see where is the root of the problem We have a salary page that shows salary information for various jobs and regions. For some jobs in some countries, this page showed a 404 page. Problem Case 1
  • 16. How did Distributed Tracing help us? 404 response from the frontend-serving service Problem Look at the trace of the call Step Timeout from an underlying service lead the service to believe that there is no data, hence 404 Case 1
  • 17. How did Distributed Tracing help us? 404 response from the frontend-serving service Problem Add more operations to the trace to pinpoint the culprit Step Case 1
  • 18. How did Distributed Tracing help us? 404 response from the frontend-serving service Problem Zoom into the code and fix the problem. Profit! Step Hot loop was doing a lookup with Linq O(n^2) instead of a dictionary-based O(n) Case 1
  • 19. How did Distributed Tracing help us? Google uses various metrics to understand how user-friendly a site is. Better metrics mean better rankings in the search engine. One of the metrics is focused on site performance — LCP, or load speed. How can we improve our load speed? Understand the full picture of what is going on in the backend by inspecting the trace of a user request Context Problem Tracing provides us a new tool: Case 2
  • 20. How did Distributed Tracing help us? How can we improve our load speed? Problem We found Multiple requests for the same data Duplicate requests from frontend Server Side Rendering (SSR) Serial requests where they can be made concurrently Case 2
  • 21. OpenTelemetry as the backbone of Distributed Tracing OpenTelemetry is an observability framework is a collection of tools, SDKs, documentation etc. for all things observability I find that it provides Ubiquitous language for the tracing concepts Standardized ways to collect, send and sample traces Interoperable implementation in various tech stacks
  • 22. Distributed Tracing primitives Span Represents an operation Implemented via Activity class
  • 23. Distributed Tracing primitives Trace Records the paths taken by requests propagated through multi-service architectures Collection of Activities with the same TraceId. TraceId is usually randomly generated, or passed from the parent operations
  • 24. Distributed Tracing primitives Attribute Attributes are key-value pairs that contain metadata you can attach to a Span Attributes are represented as Activity Tags. You may attach any tag at any point in the lifetime of the Activity
  • 25. Distributed Tracing primitives service 1 service 2 HTTP call
  • 26. Distributed Tracing primitives trace call span receive span service: service attr1: value attr2: value service: service2 attr1: value attr2: value
  • 27. Very often, you don’t use those primitives yourself, it is already done for many popular libraries — SqlClient, Redis, Hangfire, PostgreSQL etc. It is very easy to add these libraries to your tracing setup Distributed Tracing primitives
  • 28. How to propagate Trace information? version trace ID parent ID trace flags (isSampled)
  • 29. service 1 service 2 HEADERS: traceparent: 00-xx-xx-01 How to propagate Trace information? HTTP calls
  • 30. How to propagate Trace information? HTTP calls Enabling propagation in .NET Client side Client side Server side OpenTelemetry.Instrumentation.Http OpenTelemetry.Instrumentation.AspNetCore
  • 31. How to propagate Trace information? Messaging It is important to make tracing easy-to-add, so it is best to provide a way to add tracing for the most used library. For messaging, we at Jooble use RabbitMQ via the EasyNetQ library and at the point of writing, there is no built-in support for OpenTelemetry tracing. What can we do? Amazon SQS Google Cloud Pub/Sub RabbitMQ
  • 32. How to propagate Trace information? Find extension points Messaging
  • 33. How to propagate Trace information? Wrap the library methods with your own instrumentation Messaging
  • 34. How to propagate Trace information? Extract into a library Messaging
  • 35. How to propagate Trace information? In-process background workers Imagine you offload some processing to background threads. How to preserve the trace information?
  • 36. Sampling as a source of confusion Storing all traces may be infeasible due to storage concerns. Sampling is a strategy on how to choose which traces to store and which to drop. version trace ID parent ID trace flags (isSampled) Choose one strategy and stick to it: Head sampling (decide on sampling when trace is started) Tail sampling (decide on sampling after the trace is done)
  • 37. Sampling as a source of confusion tip 1 tip 2 tip 3 Understand and communicate your sampling strategy Give a way to force a sampling decision Use a much more lenient sampling strategy on test environments (e.g. record all traces) ● What service is the first to start the trace and decide on whether to record the trace? ● What proportion of traces are sampled? ● How to change the number of traces to be sampled? ● How to understand if the trace was sampled?
  • 38. Choosing a Tracing backend services traces collection traces storage+ querying Choosing a tracing backend - is an architectural decision. Delaying architectural decisions is a useful skill. Probability of change of Tracing Backend >>> Collection mechanism Decouple services from tracing backend via OpenTelemetry collector - middleware that can redirect traces to 1 or more tracing backends of choice.
  • 39. Choosing a Tracing backend ● Сhange tracing backend and/or visualization tool without changing app code ● (Optional) Try out several backends at the same time ● (Optional) Configure tail sampling or other processors that mutate traces before going to the backend If you couple your apps to the collector, you can services traces collection traces storage+ querying
  • 40. Choosing a Tracing backend Look into the capabilities and limitations of your organisation to decide on a tracing backend We at Jooble wanted Utilise our own storage and compute capabilities (why pay for things that we already have in our datacenters) Interoperation with other observability tools that we already use Performance that can handle our traffic
  • 41. Choosing a Tracing backend We chose Grafana Tempo It can store traces on your own disks It can be deployed to your hardware The visualisation of traces is built in to Grafana Claimed performance characteristics satisfied our needs
  • 42. Choosing a Tracing backend It proved to be a good choice because it went even further. Grafana Tempo now has a querying language TraceQL that can query traces based on different characteristics of spans in them E.g. it allows us to find Duplicate requests to a service DB calls that are over a certain threshold Traces that trigger a certain bug we investigate
  • 43. Problems with choosing a Cutting Edge solution - a story At some point after upgrading to Tempo 2.0 search requests started to look like this ● Search request 1: 0.5s ● Search request 2: Bad Gateway ● Search request 3: 10.5s ● Search request 4: 0.5s
  • 44. Problems with choosing a Cutting Edge solution - a story Error logs showed the next thing: After investigating Tempo Code (a very nice feature is that Tempo is OpenSource), it is clear that it is performing a recursive delete on the folder. What’s going on? error clearing completing block: unlinkat /var/tempo/wal/{folderName}: directory not empty
  • 45. Problems with choosing a Cutting Edge solution - a story The culprit - NFS (Network File Storage) server-side silly rename mechanism. If the file is opened on the server, the delete operation does not delete files, but just renames them, postponing the delete operation until the file is closed.
  • 46. Problems with choosing a Cutting Edge solution - a story Failed folder delete operations disrupted Tempo’s storage optimisation mechanism that then wreaked havoc on search performance. I filed an Pull Request that closes all files opened by Tempo, and this fixed things. Big thanks to Tempo team for responding to my questions and helping getting the fix to the finish line.
  • 47. Try tracing yourself .NET BCL provides all the required primitives Variety of tracing backends and visualisation tools: Grafana Tempo, Jaeger, Zipkin, Honeycomb etc Support for many popular libraries and frameworks is there Introduction can be incremental (one service at a time)
  • 48. How to contact me? kostiantyn@sharovarskyi.com k_sharovarskyi Check out my website where I sometimes post blog posts sharovarskyi.com
  • 49. See open roles We are hiring! Explore open positions on our website