Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed tracing - get a grasp on your production


Published on

Slides from my presentation on distributed tracing, explaining what is latency and why it matters. We took a look at openzipkin and its concepts like how the core annotations works, what are tags/logs, etc. Followed by a demo application created using golang and java (spring boot , spring cloud sleuth zipkin) . You can find source code here

Published in: Technology
  • Be the first to comment

Distributed tracing - get a grasp on your production

  1. 1. @nklmish Distributed tracing - get a grasp on your production “the most wanted and missed tool in the microservice world”
  2. 2. @nklmish Agenda Why latency ? Distributed tracing Short demo Zipkin & core concepts Code walkthrough
  3. 3. @nklmish Latency
  4. 4. @nklmish Every little bit count
  5. 5. @nklmish With scale, you see (source:
  6. 6. @nklmish Latency?
  7. 7. @nklmish User waiting
  8. 8. @nklmish Remember, slow pages lose users
  9. 9. @nklmish Distributed systems - latency analysis
  10. 10. @nklmish Story time: How bob meet longtail latency
  11. 11. @nklmish Bob didn’t knew he was suffering from Longtail latency
  12. 12. @nklmish Bob trying to troubleshooting longtail latency in distributed system
  13. 13. @nklmish Option 1: Log Analysis
  14. 14. @nklmish Lots of files
  15. 15. @nklmish Looking in logs
  16. 16. @nklmish Not everything in critical path.
  17. 17. @nklmish Correlating logs, manual works
  18. 18. @nklmish It simply doesn’t make sense
  19. 19. @nklmish Option 2: What about Metrics? (source:
  20. 20. @nklmish Something is wrong (source:
  21. 21. @nklmish Can’t tell the cause (source: ?
  22. 22. @nklmish Aggregates (avg, stdev) may deceive (source:
  23. 23. @nklmish Bob, could we find out how many clients are impacted ?
  24. 24. @nklmish Bob learn about percentiles
  25. 25. @nklmish Clients impacted by longtail latency… Percentile: 99th => 1 out of 100 visit experience D Total visits experience delay: N ÷ 100 => 5,000 Total visits affected: 8%N => 40,000 Impacts:
 a. Lot of visitsb. Repeated visits in a day 1 visit (In our distributed system): 8 downstream calls => interacting with S 
 (99% fast & 1% slow) N: No. of visits (500,000) D: Delay (50 ms) S: Highly active service (suffering from longtail latency) 1 visit encountering latency: 1-(0.99^8) = 1-0.922 => 0.077 ≈ 8% (likelihood)
  26. 26. @nklmish Boss need solution
  27. 27. @nklmish But we still don’t know… Request timeline (When it started & which operation) Logs-Correlation How the same operation behaved across different cluster/region/zone. How much deviation comparing to acceptable value. Call graph
  28. 28. @nklmish Bob was missing Distributed Tracing
  29. 29. @nklmish Distributed tracing Tracks request flow. Fast reaction (Traced data available within mins) Dynamically instruments apps. System insight, critical path, understanding call graphs (which services, which operations, at what time, etc.) Measuring E2E latency Call patterns (Optimisation) & bug discovering (Spotting redundant requests, sync vs async)
  30. 30. @nklmish How can we apply this knowledge
  31. 31. @nklmish Via Tracing system Tracing system should: Trace Have Low overhead Be scaleable Work 24 * 7 * 365 (production bugs are difficult to reproduce) Shouldn’t : Rely on programmers collaboration
  32. 32. @nklmish OpenZipkin - OpenSource tracing system
  33. 33. @nklmish OpenZipkin Zipkin is: Distributed tracing system Created by twitter Based on Dapper. OpenZipkin: Github organisation Primary Fork of Zipkin Opensource Pluggable architecture
  34. 34. @nklmish Span Denotes logical unit of work done (Timestamped) Work done is expressed in human readable string (operation name) Created by tracer (instrumenting code) Slim (KiB or less) Root span - span without parent id
  35. 35. @nklmish Zipkin annotations Client Server cs sr ss HTTP Request: get catalog (span starts) cr HTTP Response: catalog (span ends) (Processing time = ss - sr) (Response time = cr - cs) (Network latency = sr - cs) (Network latency = cr - ss) cs: client send ss: server send cr: client received
 sr: server received
  36. 36. @nklmish It’s all about trace & span HTTP Request: get catalog CataloService: getCatalog() (traceId:1, parentId:, spanId: 1) PriceService: getPrice() (traceId:1, parentId: 1, spanId: 2) ProductService: getProducts() (traceId:1, parentId: 1, spanId: 3) Database call (traceId:1, parentId: 3, spanId: 4) Data analytic call (traceId:1, parentId: 3, spanId: 5) SpanTrace
  37. 37. @nklmish Trace (E2e latency graph) DAG of spans, forms latency tree.
  38. 38. @nklmish Demo distributed-tracing-demo distributed-tracing-demo
  39. 39. @nklmish Demo application - Zipkin visualises dependencies
  40. 40. @nklmish Zipkin’s architecture APICollector UI Transport service (instrume -nted) Storage Receive spans Scribe/kafka Deserialising, sampling & scheduling for storage DB Store spans cassandra/mysql/elastic-search visualize retrieves data Collect & convert spans
  41. 41. @nklmish Tags Tag denotes: key-value pair Not timestamped A span may contain zero or more tags
  42. 42. @nklmish Log Log denotes: Event name (mark meaningful moment in lifetime of a span) Timestamped A span may contain zero or more logs
  43. 43. @nklmish Annotations Helps explaining latency with a timestamp. Annotations are often codes. e.g. sr, cs, etc.
  44. 44. @nklmish Binary Annotations Tags a span with context, usually to support query or aggregation. (e.g. http.path) Repeatable and vary on the host.
  45. 45. @nklmish Can I have large spans ( e.g. MiB) Decrease usability & increases cost of tracing system
  46. 46. @nklmish Beware of clock skew!!! 10:00 10:00
  47. 47. @nklmish Beware of clock skew!!! 10:00:01 10:00:22
  48. 48. @nklmish Tracer Does most of the heavy lifting e.g. span creation, context generation, passing info, 
 data propagation, etc.
  49. 49. @nklmish Sampling Controls how much to record High traffic Systems, fraction of traffic is enough Low traffic Systems, adjust based on your needs Note: Debug spans are always recorded.
  50. 50. @nklmish Opentracing Standardise tracing Vendor neutral tracing API Implementation available in 6 languages documentation/
  51. 51. @nklmish Spring cloud sleuth zipkin Brings distributed tracing to spring cloud Spring cloud starter zipkin
 (Zipkin + sleuth) Supports Hystrix Async Rest template Feign Zuul Spring integration …
  52. 52. @nklmish Code Walkthrough distributed-tracing-demo distributed-tracing-demo
  53. 53. @nklmish Who uses tracing
  54. 54. @nklmish Zipkin & Prometheus
  55. 55. @nklmish Zipkin for…
  56. 56. @nklmish Summary : Latency is never zero, 
 embrace it
  57. 57. @nklmish Summary Distributed systems hard to reason, complex call graphs Distributed tracing helps to analyse E2E latency & understanding call graphs Instrumentation is tricky (async, thread pool, callbacks, etc.) OpenZipkin provides: open source tracing system Visualises request flow Spring cloud sleuth brings tracing to spring world OpenTracing - goal to standardised tracing
  58. 58. @nklmish Thank You Questions? => Review => Source Code