Advertisement
Advertisement

More Related Content

More from Fwdays(20)

Advertisement

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sharovarskyi

  1. Introducing Distributed Tracing in a Large Software System
  2. Kostiantyn Sharovarskyi Software engineer, specializing in .NET I work on systems at Jooble — jobs aggregator which helps people find jobs
  3. Why should I care about Distributed Tracing?
  4. How [use-case] works? I want to understand how a use-case works
  5. How [use-case] works? I find the service codebase service service
  6. How [use-case] works? I find an interesting slice of the code service service
  7. How [use-case] works? I find that the use-case contains a db call service service db call db 1
  8. How [use-case] works? I find that the use-case contains a service call Now I need to go to another service to see what it does Repeat… service service db call HTTP call db 1
  9. How [use-case] works? Distributed Tracing Alternative I find the identifier of a certain request in PROD (trace ID)
  10. How [use-case] works? Distributed Tracing Alternative I can see the whole picture service service 2 db call db 1 db 2 message passing db call HTTP call
  11. How [use-case] works? Distributed Tracing Alternative I can see the whole picture with all the details of this exact process in action in PROD service service 2 db call db 1 db 2 message passing db call HTTP call host: SERVER 1 endpoint: /endpoint time: 1s userID: 999 host: SERVER 4 time: 0.25s host: SERVER 3 time: 0.25s queue: queue_1 host: SERVER 2 endpoint: /endpoint2 time: 0.5s
  12. Tracking application requests as they flow between services, to messaging systems, databases, etc. One can call it: Debugging for distributed systems What is Distributed Tracing?
  13. About company Jooble is №2 most popular job search aggregator 1 bil visits annually 140K resources 69 countries 2006 year of founding 25 languages according to SimilarWeb Jooble’s mission is to help people find work
  14. How did Distributed Tracing help us? Case 1
  15. How did Distributed Tracing help us? How to find the root issue? Traditionally, we can Or Go to the code, read it and create hypotheses about why this could happen Add additional logs, add more logs continuously until we understand what is going on Look at the trace to see where is the root of the problem We have a salary page that shows salary information for various jobs and regions. For some jobs in some countries, this page showed a 404 page. Problem Case 1
  16. How did Distributed Tracing help us? 404 response from the frontend-serving service Problem Look at the trace of the call Step Timeout from an underlying service lead the service to believe that there is no data, hence 404 Case 1
  17. How did Distributed Tracing help us? 404 response from the frontend-serving service Problem Add more operations to the trace to pinpoint the culprit Step Case 1
  18. How did Distributed Tracing help us? 404 response from the frontend-serving service Problem Zoom into the code and fix the problem. Profit! Step Hot loop was doing a lookup with Linq O(n^2) instead of a dictionary-based O(n) Case 1
  19. How did Distributed Tracing help us? Google uses various metrics to understand how user-friendly a site is. Better metrics mean better rankings in the search engine. One of the metrics is focused on site performance — LCP, or load speed. How can we improve our load speed? Understand the full picture of what is going on in the backend by inspecting the trace of a user request Context Problem Tracing provides us a new tool: Case 2
  20. How did Distributed Tracing help us? How can we improve our load speed? Problem We found Multiple requests for the same data Duplicate requests from frontend Server Side Rendering (SSR) Serial requests where they can be made concurrently Case 2
  21. OpenTelemetry as the backbone of Distributed Tracing OpenTelemetry is an observability framework is a collection of tools, SDKs, documentation etc. for all things observability I find that it provides Ubiquitous language for the tracing concepts Standardized ways to collect, send and sample traces Interoperable implementation in various tech stacks
  22. Distributed Tracing primitives Span Represents an operation Implemented via Activity class
  23. Distributed Tracing primitives Trace Records the paths taken by requests propagated through multi-service architectures Collection of Activities with the same TraceId. TraceId is usually randomly generated, or passed from the parent operations
  24. Distributed Tracing primitives Attribute Attributes are key-value pairs that contain metadata you can attach to a Span Attributes are represented as Activity Tags. You may attach any tag at any point in the lifetime of the Activity
  25. Distributed Tracing primitives service 1 service 2 HTTP call
  26. Distributed Tracing primitives trace call span receive span service: service attr1: value attr2: value service: service2 attr1: value attr2: value
  27. Very often, you don’t use those primitives yourself, it is already done for many popular libraries — SqlClient, Redis, Hangfire, PostgreSQL etc. It is very easy to add these libraries to your tracing setup Distributed Tracing primitives
  28. How to propagate Trace information? version trace ID parent ID trace flags (isSampled)
  29. service 1 service 2 HEADERS: traceparent: 00-xx-xx-01 How to propagate Trace information? HTTP calls
  30. How to propagate Trace information? HTTP calls Enabling propagation in .NET Client side Client side Server side OpenTelemetry.Instrumentation.Http OpenTelemetry.Instrumentation.AspNetCore
  31. How to propagate Trace information? Messaging It is important to make tracing easy-to-add, so it is best to provide a way to add tracing for the most used library. For messaging, we at Jooble use RabbitMQ via the EasyNetQ library and at the point of writing, there is no built-in support for OpenTelemetry tracing. What can we do? Amazon SQS Google Cloud Pub/Sub RabbitMQ
  32. How to propagate Trace information? Find extension points Messaging
  33. How to propagate Trace information? Wrap the library methods with your own instrumentation Messaging
  34. How to propagate Trace information? Extract into a library Messaging
  35. How to propagate Trace information? In-process background workers Imagine you offload some processing to background threads. How to preserve the trace information?
  36. Sampling as a source of confusion Storing all traces may be infeasible due to storage concerns. Sampling is a strategy on how to choose which traces to store and which to drop. version trace ID parent ID trace flags (isSampled) Choose one strategy and stick to it: Head sampling (decide on sampling when trace is started) Tail sampling (decide on sampling after the trace is done)
  37. Sampling as a source of confusion tip 1 tip 2 tip 3 Understand and communicate your sampling strategy Give a way to force a sampling decision Use a much more lenient sampling strategy on test environments (e.g. record all traces) ● What service is the first to start the trace and decide on whether to record the trace? ● What proportion of traces are sampled? ● How to change the number of traces to be sampled? ● How to understand if the trace was sampled?
  38. Choosing a Tracing backend services traces collection traces storage+ querying Choosing a tracing backend - is an architectural decision. Delaying architectural decisions is a useful skill. Probability of change of Tracing Backend >>> Collection mechanism Decouple services from tracing backend via OpenTelemetry collector - middleware that can redirect traces to 1 or more tracing backends of choice.
  39. Choosing a Tracing backend ● Сhange tracing backend and/or visualization tool without changing app code ● (Optional) Try out several backends at the same time ● (Optional) Configure tail sampling or other processors that mutate traces before going to the backend If you couple your apps to the collector, you can services traces collection traces storage+ querying
  40. Choosing a Tracing backend Look into the capabilities and limitations of your organisation to decide on a tracing backend We at Jooble wanted Utilise our own storage and compute capabilities (why pay for things that we already have in our datacenters) Interoperation with other observability tools that we already use Performance that can handle our traffic
  41. Choosing a Tracing backend We chose Grafana Tempo It can store traces on your own disks It can be deployed to your hardware The visualisation of traces is built in to Grafana Claimed performance characteristics satisfied our needs
  42. Choosing a Tracing backend It proved to be a good choice because it went even further. Grafana Tempo now has a querying language TraceQL that can query traces based on different characteristics of spans in them E.g. it allows us to find Duplicate requests to a service DB calls that are over a certain threshold Traces that trigger a certain bug we investigate
  43. Problems with choosing a Cutting Edge solution - a story At some point after upgrading to Tempo 2.0 search requests started to look like this ● Search request 1: 0.5s ● Search request 2: Bad Gateway ● Search request 3: 10.5s ● Search request 4: 0.5s
  44. Problems with choosing a Cutting Edge solution - a story Error logs showed the next thing: After investigating Tempo Code (a very nice feature is that Tempo is OpenSource), it is clear that it is performing a recursive delete on the folder. What’s going on? error clearing completing block: unlinkat /var/tempo/wal/{folderName}: directory not empty
  45. Problems with choosing a Cutting Edge solution - a story The culprit - NFS (Network File Storage) server-side silly rename mechanism. If the file is opened on the server, the delete operation does not delete files, but just renames them, postponing the delete operation until the file is closed.
  46. Problems with choosing a Cutting Edge solution - a story Failed folder delete operations disrupted Tempo’s storage optimisation mechanism that then wreaked havoc on search performance. I filed an Pull Request that closes all files opened by Tempo, and this fixed things. Big thanks to Tempo team for responding to my questions and helping getting the fix to the finish line.
  47. Try tracing yourself .NET BCL provides all the required primitives Variety of tracing backends and visualisation tools: Grafana Tempo, Jaeger, Zipkin, Honeycomb etc Support for many popular libraries and frameworks is there Introduction can be incremental (one service at a time)
  48. How to contact me? kostiantyn@sharovarskyi.com k_sharovarskyi Check out my website where I sometimes post blog posts sharovarskyi.com
  49. See open roles We are hiring! Explore open positions on our website
  50. Thank you! Questions?
Advertisement