Software systems are growing in size and complexity when the business is growing, and sometimes it is hard to figure out what is going on. Various teams make different changes for different business capabilities. Distributed Tracing is a useful way to look under the hood and see for yourself what operations are being performed, what services are used in a certain use case, and how performant are they. In this talk, I will present what Distributed Tracing is and how we introduced it into our software system with some tips and tricks on what you should focus on if you want to do the same.
7. How [use-case] works?
I find that the use-case contains a db call
service
service db call
db 1
8. How [use-case] works?
I find that the use-case contains a service call
Now I need to go to another service to see what it does
Repeat…
service
service
db call
HTTP call
db 1
10. How [use-case] works?
Distributed Tracing Alternative
I can see the whole picture
service
service 2
db call
db 1
db 2
message
passing
db call
HTTP call
11. How [use-case] works?
Distributed Tracing Alternative
I can see the whole picture with all the details
of this exact process in action in PROD
service
service 2
db call
db 1
db 2
message
passing
db call
HTTP call
host: SERVER 1
endpoint: /endpoint
time: 1s
userID: 999
host: SERVER 4
time: 0.25s
host: SERVER 3
time: 0.25s
queue: queue_1
host: SERVER 2
endpoint: /endpoint2
time: 0.5s
12. Tracking application requests as they
flow between services, to messaging
systems, databases, etc.
One can call it:
Debugging for distributed systems
What is Distributed Tracing?
13. About company
Jooble is №2 most popular
job search aggregator
1 bil
visits annually
140K
resources
69
countries
2006
year of founding
25
languages
according to SimilarWeb
Jooble’s mission is to help people find work
15. How did Distributed Tracing help us?
How to find the root issue?
Traditionally, we can
Or
Go to the code, read it and create
hypotheses about why this could happen
Add additional logs, add more logs continuously
until we understand what is going on
Look at the trace to see where
is the root of the problem
We have a salary page that shows salary information
for various jobs and regions. For some jobs in some
countries, this page showed a 404 page.
Problem
Case 1
16. How did Distributed Tracing help us?
404 response from the frontend-serving service
Problem
Look at the trace of the call
Step
Timeout from an underlying service lead the service to believe that there is no data, hence 404
Case 1
17. How did Distributed Tracing help us?
404 response from the frontend-serving service
Problem
Add more operations to the trace to pinpoint the culprit
Step
Case 1
18. How did Distributed Tracing help us?
404 response from the frontend-serving service
Problem
Zoom into the code and fix the problem. Profit!
Step
Hot loop was doing a lookup with Linq O(n^2) instead of a dictionary-based O(n)
Case 1
19. How did Distributed Tracing help us?
Google uses various metrics to understand how
user-friendly a site is. Better metrics mean better
rankings in the search engine. One of the metrics is
focused on site performance — LCP, or load speed.
How can we improve our load speed?
Understand the full picture of what is going on in the
backend by inspecting the trace of a user request
Context
Problem
Tracing provides us a new tool:
Case 2
20. How did Distributed Tracing help us?
How can we improve our load speed?
Problem
We found
Multiple requests for the same data
Duplicate requests from frontend Server Side
Rendering (SSR)
Serial requests where they
can be made concurrently
Case 2
21. OpenTelemetry as the backbone
of Distributed Tracing
OpenTelemetry is an observability framework is a collection
of tools, SDKs, documentation etc. for all things observability
I find that it provides
Ubiquitous language for the tracing concepts
Standardized ways to collect, send and sample traces
Interoperable implementation in various tech stacks
23. Distributed Tracing primitives Trace
Records the paths taken by
requests propagated through
multi-service architectures
Collection of Activities with the
same TraceId. TraceId is usually
randomly generated, or passed
from the parent operations
24. Distributed Tracing primitives Attribute
Attributes are key-value pairs
that contain metadata you
can attach to a Span
Attributes are represented as
Activity Tags. You may attach
any tag at any point in the
lifetime of the Activity
27. Very often, you don’t use those primitives
yourself, it is already done for many popular
libraries — SqlClient, Redis, Hangfire, PostgreSQL
etc. It is very easy to add these libraries to your
tracing setup
Distributed Tracing primitives
28. How to propagate Trace information?
version trace ID parent ID trace flags
(isSampled)
29. service 1 service 2
HEADERS:
traceparent: 00-xx-xx-01
How to propagate Trace information? HTTP calls
30. How to propagate Trace information? HTTP calls
Enabling propagation in
.NET
Client side
Client side Server side
OpenTelemetry.Instrumentation.Http OpenTelemetry.Instrumentation.AspNetCore
31. How to propagate Trace information? Messaging
It is important to make tracing easy-to-add, so it is
best to provide a way to add tracing for the most
used library.
For messaging, we at Jooble use RabbitMQ via the
EasyNetQ library and at the point of writing, there
is no built-in support for OpenTelemetry tracing.
What can we do?
Amazon SQS
Google Cloud Pub/Sub
RabbitMQ
32. How to propagate Trace information?
Find extension points
Messaging
33. How to propagate Trace information?
Wrap the library methods with your own instrumentation
Messaging
34. How to propagate Trace information?
Extract into a library
Messaging
35. How to propagate Trace information?
In-process background workers
Imagine you offload some processing to background threads.
How to preserve the trace information?
36. Sampling as a source of confusion
Storing all traces may be infeasible due
to storage concerns.
Sampling is a strategy on how to choose
which traces to store and which to drop.
version trace ID parent ID trace flags
(isSampled)
Choose one strategy and stick to it:
Head sampling (decide on sampling when trace is started)
Tail sampling (decide on sampling after the trace is done)
37. Sampling as a source of confusion
tip 1 tip 2
tip 3
Understand and communicate
your sampling strategy
Give a way to force
a sampling decision
Use a much more lenient sampling
strategy on test environments (e.g.
record all traces)
● What service is the first to start the trace and
decide on whether to record the trace?
● What proportion of traces are sampled?
● How to change the number of traces to be
sampled?
● How to understand if the trace was sampled?
38. Choosing a Tracing backend
services traces collection
traces storage+
querying
Choosing a tracing backend - is an architectural decision.
Delaying architectural decisions is a useful skill.
Probability of change of Tracing Backend >>> Collection mechanism
Decouple services from tracing backend via OpenTelemetry collector - middleware
that can redirect traces to 1 or more tracing backends of choice.
39. Choosing a Tracing backend
● Сhange tracing backend and/or visualization tool without changing app code
● (Optional) Try out several backends at the same time
● (Optional) Configure tail sampling or other processors that mutate traces before
going to the backend
If you couple your apps to the collector, you can
services traces collection
traces storage+
querying
40. Choosing a Tracing backend
Look into the capabilities and limitations
of your organisation to decide on a
tracing backend
We at Jooble wanted
Utilise our own storage and compute capabilities
(why pay for things that we already have in our
datacenters)
Interoperation with other observability tools
that we already use
Performance that can handle our traffic
41. Choosing a Tracing backend
We chose Grafana Tempo
It can store traces on your own disks
It can be deployed to your hardware
The visualisation of traces is built in to Grafana
Claimed performance characteristics satisfied our needs
42. Choosing a Tracing backend
It proved to be a good choice because it went even
further.
Grafana Tempo now has a querying language
TraceQL that can query traces based on different
characteristics of spans in them
E.g. it allows us to find
Duplicate requests to a service
DB calls that are over a certain threshold
Traces that trigger a certain bug we investigate
43. Problems with choosing a
Cutting Edge solution - a story
At some point after upgrading to Tempo 2.0
search requests started to look like this
● Search request 1: 0.5s
● Search request 2: Bad Gateway
● Search request 3: 10.5s
● Search request 4: 0.5s
44. Problems with choosing a
Cutting Edge solution - a story
Error logs showed the next thing: After investigating Tempo Code (a very nice feature is
that Tempo is OpenSource), it is clear that it is
performing a recursive delete on the folder.
What’s going on?
error clearing completing block: unlinkat
/var/tempo/wal/{folderName}:
directory not empty
45. Problems with choosing a
Cutting Edge solution - a story
The culprit - NFS (Network File Storage) server-side silly rename mechanism.
If the file is opened on the server, the delete operation does not delete files, but
just renames them, postponing the delete operation until the file is closed.
46. Problems with choosing a
Cutting Edge solution - a story
Failed folder delete operations disrupted Tempo’s storage optimisation mechanism that then
wreaked havoc on search performance.
I filed an Pull Request that closes all files opened by Tempo, and this fixed things. Big thanks
to Tempo team for responding to my questions and helping getting the fix to the finish line.
47. Try tracing yourself
.NET BCL provides all the
required primitives
Variety of tracing backends and visualisation tools:
Grafana Tempo, Jaeger, Zipkin, Honeycomb etc
Support for many popular libraries
and frameworks is there
Introduction can be incremental
(one service at a time)
48. How to contact me? kostiantyn@sharovarskyi.com
k_sharovarskyi
Check out my website where I sometimes
post blog posts sharovarskyi.com