Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Tracing, from internal SAAS insights

Distributed Tracing, from internal SAAS insights

  • Be the first to comment

Distributed Tracing, from internal SAAS insights

  1. 1. Distributed Tracing Insights from internal SAAS team
  2. 2. Calendar > Background > Our internal DistTrace problems > Searching for better solution > Conclusion
  3. 3. > @dxhuy (フィ) > LINE’s Observability Team EM Introduction
  4. 4. > Large scale metrics platform (400M datapoints per min) > ~~ Logs platform (1M log per min) > ~~ Dist Tracing platform (2M spans per min) Our team https://speakerdeck.com/line_devday2019/deep-dive-into-lines-time-series-database-flash
  5. 5. > Large scale metrics platform (400M datapoints per min) > ~~ Logs platform (1M log per min) > ~~ Dist Tracing platform (2M spans per min) Our team
  6. 6. > What is Distributed Tracing > Request scope DistTrace > Concepts Prerequisite know-how
  7. 7. Elasticsearch Our setup Brave
 base client Zipkin base
 internal server Elasticsearch Elasticsearch Thrift Custom Armeria
 based collector https://github.com/line/armeria NvME based
 High spec machines
  8. 8. Our setup > Inhouse customized Zipkin UI - Now became upstream default (zipkin-lens)
  9. 9. Problems with OSS multi-tenant DistTracing > Storage cost + scalability > Standard war (instrument lib) > UI / UX for multi-tenant > User voice: high implement cost, useful spans be sampled out..
  10. 10. Storage problem Infra cost Useful data If we sampling 
 100% data Infra cost for Trace >> Infra cost for app
  11. 11. Sampling problem > Sampling to reduce storage cost > All OSS employ “head-based” sampling (unbiased sampling) - Many useful data will be sampling out - What is useful data btw?
  12. 12. OSS UI / UX problem > Search by “serviceName” is hard when you have 100 teams and each team has 50 services > Time range query is mostly useless when you have 1000 rps
  13. 13. Standard war Zipkin B3 header OpenTracing OpenCencus OpenTelemetry Language base (go, jvm) Middleware base (htrace..)
  14. 14. How could we make it better?
  15. 15. Trace without trace > The Mystery Machine: End-to-end performance analysis of large-scale Internet services - First trace solution at facebook (before Canopi) - Use “LOG” to calculate causal analysis http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.3942&rep=rep1&type=pdf
  16. 16. Trace without trace (cont) > Canopy: An End-to-End Performance Tracing And Analysis System (Facebook) > From scratch solution (even trace API, trace standard) > Head based sampling (token bucket) https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
  17. 17. Trace without Trace (cont) > Transparent tracing of microservice-based applications - Utilize Proxy for trace interception - Utilize Linux Syscall for trace instrumentation https://dl.acm.org/doi/abs/10.1145/3297280.3297403
  18. 18. Unify standard > Universal context propagation for distributed system instrumentation - Propose universal “standard” for context data to overcome standard war https://dl.acm.org/doi/abs/10.1145/3190508.3190526
  19. 19. Better sampling method > Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering - Tailed-based sampling method - Utilize Machine-Learning to overcome feature selection problem https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  20. 20. Better sampling method > Honeycomb refinery (DistTrace vendor) - Feature based, tail-based sampling - https://docs.honeycomb.io/working-with- your-data/tracing/refinery/ https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  21. 21. Better sampling method (cont) > Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering - Tailed-based sampling method - Utilize Machine-Learning to overcome feature selection problem https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  22. 22. What we could do Some insights from our company > Log is the most informative one of 3 pillars > Tail-based sampling seems the cure for “usefulness” of DistTrace service
  23. 23. OSS move Firehorse mode of Zipkin and Jeager (100% sampling, skip indexing) > https://cwiki.apache.org/ confluence/display/ZIPKIN/ Firehose+mode > https://github.com/jaegertracing/ jaeger/issues/1731
  24. 24. Propose architecture Firehose mode Trace Client Feature base
 Sampler Storage 
 Log Client Inject trace ID Sample-in -Error trace -High Latency trace Construct Trace by Buffer span in memory
  25. 25. No more search UI for trace > User traverse to trace by trace ID only (no more time range base search) > User get trace ID from LOG search UI (you need search-able LOG, example: Kibana)
  26. 26. Thanks you for listening

×