"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi

Introducing Distributed Tracing in
a Large Software System

Kostiantyn Sharovarskyi
Software engineer, specializing in .NET
I work on systems at Jooble — jobs
aggregator which helps people find jobs

Why should I care about
Distributed Tracing?

How [use-case] works?
I want to understand how a use-case works

I find the service codebase
service
service

I find an interesting slice of the code
service
service

I find that the use-case contains a db call
service
service db call
db 1

I find that the use-case contains a service call
Now I need to go to another service to see what it does
Repeat…
service
service
db call
HTTP call
db 1

Distributed Tracing Alternative
I find the identifier of a certain request in PROD (trace ID)

I can see the whole picture
service
service 2
db call
db 1
db 2
message
passing
db call
HTTP call

I can see the whole picture with all the details
of this exact process in action in PROD
service
service 2
db call
db 1
db 2
message
passing
db call
HTTP call
host: SERVER 1
endpoint: /endpoint
time: 1s
userID: 999
host: SERVER 4
time: 0.25s
host: SERVER 3
time: 0.25s
queue: queue_1
host: SERVER 2
endpoint: /endpoint2
time: 0.5s

Tracking application requests as they
flow between services, to messaging
systems, databases, etc.
One can call it:
Debugging for distributed systems
What is Distributed Tracing?

About company
Jooble is №2 most popular
job search aggregator
1 bil
visits annually
140K
resources
69
countries
2006
year of founding
25
languages
according to SimilarWeb
Jooble’s mission is to help people find work

How did Distributed Tracing help us? Case 1

How did Distributed Tracing help us?
How to find the root issue?
Traditionally, we can
Or
Go to the code, read it and create
hypotheses about why this could happen
Add additional logs, add more logs continuously
until we understand what is going on
Look at the trace to see where
is the root of the problem
We have a salary page that shows salary information
for various jobs and regions. For some jobs in some
countries, this page showed a 404 page.
Problem
Case 1

404 response from the frontend-serving service
Problem
Look at the trace of the call
Step
Timeout from an underlying service lead the service to believe that there is no data, hence 404
Case 1

Problem
Add more operations to the trace to pinpoint the culprit
Step
Case 1

Problem
Zoom into the code and fix the problem. Profit!
Step
Hot loop was doing a lookup with Linq O(n^2) instead of a dictionary-based O(n)
Case 1

Google uses various metrics to understand how
user-friendly a site is. Better metrics mean better
rankings in the search engine. One of the metrics is
focused on site performance — LCP, or load speed.
How can we improve our load speed?
Understand the full picture of what is going on in the
backend by inspecting the trace of a user request
Context
Problem
Tracing provides us a new tool:
Case 2

How can we improve our load speed?
Problem
We found
Multiple requests for the same data
Duplicate requests from frontend Server Side
Rendering (SSR)
Serial requests where they
can be made concurrently
Case 2

OpenTelemetry as the backbone
of Distributed Tracing
OpenTelemetry is an observability framework is a collection
of tools, SDKs, documentation etc. for all things observability
I find that it provides
Ubiquitous language for the tracing concepts
Standardized ways to collect, send and sample traces
Interoperable implementation in various tech stacks

Distributed Tracing primitives Span
Represents an operation Implemented via Activity class

Distributed Tracing primitives Trace
Records the paths taken by
requests propagated through
multi-service architectures
Collection of Activities with the
same TraceId. TraceId is usually
randomly generated, or passed
from the parent operations

Distributed Tracing primitives Attribute
Attributes are key-value pairs
that contain metadata you
can attach to a Span
Attributes are represented as
Activity Tags. You may attach
any tag at any point in the
lifetime of the Activity

Distributed Tracing primitives
service 1 service 2
HTTP call

trace
call span
receive span
service: service
attr1: value
attr2: value
service: service2
attr1: value
attr2: value

Very often, you don’t use those primitives
yourself, it is already done for many popular
libraries — SqlClient, Redis, Hangfire, PostgreSQL
etc. It is very easy to add these libraries to your
tracing setup

How to propagate Trace information?
version trace ID parent ID trace flags
(isSampled)

service 1 service 2
HEADERS:
traceparent: 00-xx-xx-01
How to propagate Trace information? HTTP calls

How to propagate Trace information? HTTP calls
Enabling propagation in
.NET
Client side
Client side Server side
OpenTelemetry.Instrumentation.Http OpenTelemetry.Instrumentation.AspNetCore

How to propagate Trace information? Messaging
It is important to make tracing easy-to-add, so it is
best to provide a way to add tracing for the most
used library.
For messaging, we at Jooble use RabbitMQ via the
EasyNetQ library and at the point of writing, there
is no built-in support for OpenTelemetry tracing.
What can we do?
Amazon SQS
Google Cloud Pub/Sub
RabbitMQ

Find extension points
Messaging

Wrap the library methods with your own instrumentation
Messaging

Extract into a library
Messaging

In-process background workers
Imagine you offload some processing to background threads.
How to preserve the trace information?

Sampling as a source of confusion
Storing all traces may be infeasible due
to storage concerns.
Sampling is a strategy on how to choose
which traces to store and which to drop.
version trace ID parent ID trace flags
(isSampled)
Choose one strategy and stick to it:
Head sampling (decide on sampling when trace is started)
Tail sampling (decide on sampling after the trace is done)

Sampling as a source of confusion
tip 1 tip 2
tip 3
Understand and communicate
your sampling strategy
Give a way to force
a sampling decision
Use a much more lenient sampling
strategy on test environments (e.g.
record all traces)
● What service is the first to start the trace and
decide on whether to record the trace?
● What proportion of traces are sampled?
● How to change the number of traces to be
sampled?
● How to understand if the trace was sampled?

Choosing a Tracing backend
services traces collection
traces storage+
querying
Choosing a tracing backend - is an architectural decision.
Delaying architectural decisions is a useful skill.
Probability of change of Tracing Backend >>> Collection mechanism
Decouple services from tracing backend via OpenTelemetry collector - middleware
that can redirect traces to 1 or more tracing backends of choice.

● Сhange tracing backend and/or visualization tool without changing app code
● (Optional) Try out several backends at the same time
● (Optional) Configure tail sampling or other processors that mutate traces before
going to the backend
If you couple your apps to the collector, you can
services traces collection
traces storage+
querying

Look into the capabilities and limitations
of your organisation to decide on a
tracing backend
We at Jooble wanted
Utilise our own storage and compute capabilities
(why pay for things that we already have in our
datacenters)
Interoperation with other observability tools
that we already use
Performance that can handle our traffic

We chose Grafana Tempo
It can store traces on your own disks
It can be deployed to your hardware
The visualisation of traces is built in to Grafana
Claimed performance characteristics satisfied our needs

It proved to be a good choice because it went even
further.
Grafana Tempo now has a querying language
TraceQL that can query traces based on different
characteristics of spans in them
E.g. it allows us to find
Duplicate requests to a service
DB calls that are over a certain threshold
Traces that trigger a certain bug we investigate

Problems with choosing a
Cutting Edge solution - a story
At some point after upgrading to Tempo 2.0
search requests started to look like this
● Search request 1: 0.5s
● Search request 2: Bad Gateway

Error logs showed the next thing: After investigating Tempo Code (a very nice feature is
that Tempo is OpenSource), it is clear that it is
performing a recursive delete on the folder.
What’s going on?
error clearing completing block: unlinkat
/var/tempo/wal/{folderName}:
directory not empty

The culprit - NFS (Network File Storage) server-side silly rename mechanism.
If the file is opened on the server, the delete operation does not delete files, but
just renames them, postponing the delete operation until the file is closed.

Failed folder delete operations disrupted Tempo’s storage optimisation mechanism that then
wreaked havoc on search performance.
I filed an Pull Request that closes all files opened by Tempo, and this fixed things. Big thanks
to Tempo team for responding to my questions and helping getting the fix to the finish line.

Try tracing yourself
.NET BCL provides all the
required primitives
Variety of tracing backends and visualisation tools:
Grafana Tempo, Jaeger, Zipkin, Honeycomb etc
Support for many popular libraries
and frameworks is there
Introduction can be incremental
(one service at a time)

How to contact me? kostiantyn@sharovarskyi.com
k_sharovarskyi
Check out my website where I sometimes
post blog posts sharovarskyi.com

See open roles
We are hiring!
Explore open positions on our website

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi

Recommended

Recommended

More Related Content

Similar to "Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi

Similar to "Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi (20)

More from Fwdays

More from Fwdays (20)

Recently uploaded

Recently uploaded (20)

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi