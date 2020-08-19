Successfully reported this slideshow.
Why Distributed Tracing is Essential for Performance and Reliability Daniel "Spoons" Spoonhower Julie Lawson
Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale or complexity. Lightstep int...
About Daniel “Spoons” Spoonhower Daniel “Spoons” Spoonhower is a co-founder at Lightstep, where he’s building performance ...
Daniel “Spoons” Spoonhower, CTO and Co-founder Why Distributed Tracing is Essential for Performance and Reliability or, Ho...
Spoons (aka Daniel Spoonhower) CTO and Co-founder 6 @save_spoons spoons@lightstep.com✉ Who am I?
What Changed? 7
8 More autonomy… but less visibility!
Observe kustomize ? ?? Control Team-by- team ok Must be org-wide! Distributed tracing!
Distributed tracing: essential for modern apps & DevOps 10 Developer velocity Software performance Managing costs Fundamen...
Distributed Tracing 11
Traces are a form of telemetry based on spans - Span = timed event describing work done by a single service Distributed tr...
Relationships matter 14 Traces encode causal relationships between callers and callees calls returns
Traces are the raw material, not the finished product Distributed traces – basically just structs Distributed tracing – th...
Developer Velocity 16
Increasing developer velocity - Make (common) tasks faster - Reduce interruptions - Improve communication - Prioritize hig...
Accelerate root cause analysis 18
More actionable alerts 19 “Are We All on the Same Page? Let’s Fix That,”  Luis Mineiro Check out 99percentdevops.com
Understanding dependencies… without tracing 20 A B C E D C B B D A B E D 8% error rate avg. response size up 31% request r...
Understanding dependencies Without tracing... - Each connection in isolation - “A talks to B” - No way to narrow scope - N...
Use traces and service dependencies - Enhance training for new team members - Facilitate operational review meetings - Inf...
Software Performance 23
Improving software performance Performance means “performance as experienced by end users” Tracing can help by… - Better d...
Defining the critical path 25 A (part of a) span is on the critical path if: - reducing its duration speeds up overall req...
Rebalancing fan-out 26
Given a choice between speeding up A and B… 1. 50% improvement in B is better than an 50% improvement in A 2. No improveme...
Managing Costs 28
Types of costs Operational costs - Developer time (failed deployments, oncall, meeting overhead) Revenue and reputational ...
Calculating logging costs Initial Factors ‐ Aggregating and indexing logs per service: ‐ Storage ‐ Compute ‐ Network ‐ Pea...
$716 Cloud spend @ 50GB/logs (monthly) 31
$3,386 Total after setup, maintenance (monthly) 32
Reducing logging spend with tracing Annotate spans with logs! It’s as easy as: span.addEvent(“illegal base64 data at input...
Deploying Tracing 34
On your tracing migration Tracing is not an all-or-nothing endeavour - How to deliver incremental value for the org - How ...
Step 1 Start w/ customer-critical experiences Look at the edge and build an MVP - As close as you can (reasonably) get to...
Step 2 Playbook for service owners Establish conventions for tags, etc. - What matters to your business? - What would exp...
Step 3 Integrate with existing workflows Where do engineers work today? - IDEs, testing frameworks, CI/CD - Dashboards - ...
Building observable services Use open standards like OpenTelemetry for instrumenting service code. OpenTelemetry provides ...
In summary, distributed tracing provides... Faster RCA Better alerts Up-to-date dependency maps Improved compute fan-out T...
Get Started Today go.lightstep.com/request-a-demo
Q&A Julie Lawson With: With: CTO and Co-founder, Lightstep LinkedIn: /in/spoons Website: lightstep.com Twitter: @save_spoo...
Daniel “Spoons” Spoonhower, CTO and Co-founder Thank you @save_spoons
Why Distributed Tracing is Essential for Performance and Reliability

Many engineering organizations have now adopted microservices or other loosely coupled architectures, often alongside DevOps practices. Together these have enabled individual service teams to become more independent and, as a result, have boosted developer velocity. However, this increased velocity often comes at the cost of overall application performance or reliability. Worse, teams often don't understand what's affecting performance or reliability – or even who to ask to learn more. Distributed tracing was developed at organizations like Google and Twitter to address these problems and has also come a long way in the decade since then. By the end of this presentation, you'll understand why distributed tracing is necessary and how it can bring performance and reliability back under control.

Published in: Technology
Why Distributed Tracing is Essential for Performance and Reliability

  2. 2. Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale or complexity. Lightstep integrates seamlessly into daily workflows, whether you are proactively optimizing performance or investigating a root cause, so you can quickly get back to building features.
  3. 3. Click on the Questions panel to interact with the presenters
  4. 4. About Daniel “Spoons” Spoonhower Daniel “Spoons” Spoonhower is a co-founder at Lightstep, where he’s building performance management tools for deep software systems. He is an author of Distributed Tracing in Practice O’Reilly Media, 2020. Previously, Spoons spent almost six years at Google where he worked as part of Google’s infrastructure and Cloud Platform teams. He has published papers on the performance of parallel programs, garbage collection, and real-time programming. He has a PhD in programming languages from Carnegie Mellon University but still hasn’t found one he loves. About Julie Lawson Julie Lawson majored in English at Boston University. She worked at the college radio station, WTBU, where she developed a passion for producing shows with good music and good stories. She started her career at a small publishing house in Los Angeles and went on to become Webinar Coordinator at Aggregage, where she produces webinars and facilitates BTS webinar functions. In her spare time, she enjoys going to the beach, camping and reading great books.
  5. 5. Daniel “Spoons” Spoonhower, CTO and Co-founder Why Distributed Tracing is Essential for Performance and Reliability or, How to Get Actual Business Value From Distributed Tracing!
  6. 6. Spoons (aka Daniel Spoonhower) CTO and Co-founder 6 @save_spoons spoons@lightstep.com✉ Who am I?
  7. 7. What Changed? 7
  8. 8. 8 More autonomy… but less visibility!
  9. 9. Observe kustomize ? ?? Control Team-by- team ok Must be org-wide! Distributed tracing!
  10. 10. Distributed tracing: essential for modern apps & DevOps 10 Developer velocity Software performance Managing costs Fundamentals Deploying tracing
  11. 11. Distributed Tracing 11
  12. 12. Traces are a form of telemetry based on spans - Span = timed event describing work done by a single service Distributed tracing, defined 13 Tracing is a diagnostic tool that reveals… … how a set of services coordinate to handle individual user requests … from mobile or browser to backends to databases (end-to-end) … including metadata like events (logs) and annotations (tags) Tracing provides a request-centric view of application performance
  13. 13. Relationships matter 14 Traces encode causal relationships between callers and callees calls returns
  14. 14. Traces are the raw material, not the finished product Distributed traces – basically just structs Distributed tracing – the art and science of deriving value from traces 15
  15. 15. Developer Velocity 16
  16. 16. Increasing developer velocity - Make (common) tasks faster - Reduce interruptions - Improve communication - Prioritize high impact work 17 Verify deployments Root cause analysis Better alerts Understand dependencies Deﬁne and track SLOs
  17. 17. Accelerate root cause analysis 18
  18. 18. More actionable alerts 19 “Are We All on the Same Page? Let’s Fix That,”  Luis Mineiro Check out 99percentdevops.com
  19. 19. Understanding dependencies… without tracing 20 A B C E D C B B D A B E D 8% error rate avg. response size up 31% request rate up 4%
  20. 20. Understanding dependencies Without tracing... - Each connection in isolation - “A talks to B” - No way to narrow scope - No way to meaningfully tie in other metrics 21 With tracing... - End-to-end context - Request graph - Can refine based on any property of the request - Metrics linked to current scope
  21. 21. Use traces and service dependencies - Enhance training for new team members - Facilitate operational review meetings - Inform architectural design decisions - Set SLOs for internal services 22 Use SLOs to…- Measure reliability- Set error budgets- Hold teams accountable
  22. 22. Software Performance 23
  23. 23. Improving software performance Performance means “performance as experienced by end users” Tracing can help by… - Better distribution of computation - Focusing optimization where it matters 24
  24. 24. Defining the critical path 25 A (part of a) span is on the critical path if: - reducing its duration speeds up overall request waiting for blue… therefore, blue is on the critical path here
  25. 25. Rebalancing fan-out 26
  26. 26. Given a choice between speeding up A and B… 1. 50% improvement in B is better than an 50% improvement in A 2. No improvement in A will ever improve overall performance by >15% Obvious… once you have the data :) A B A B A B Amdahl’s Law 27 OR
  27. 27. Managing Costs 28
  28. 28. Types of costs Operational costs - Developer time (failed deployments, oncall, meeting overhead) Revenue and reputational costs - Missed SLOs, failed conversions, unhappy users Infrastructure costs - Compute, network, storage, API usage Monitoring costs 29 Take aggregated logs as an example
  29. 29. Calculating logging costs Initial Factors ‐ Aggregating and indexing logs per service: ‐ Storage ‐ Compute ‐ Network ‐ Peak instance count ‐ Retention period ‐ Services involved in a request Initial Values Assuming 50GB of log data a day, 14 day retention, high availability (no cold storage) 1 Primary (L Compute Optimized)  $89 2 Data (XL Memory Optimized)  $426 3 SSDs (General Purpose)  $201 30
  30. 30. $716 Cloud spend @ 50GB/logs (monthly) 31
  31. 31. $3,386 Total after setup, maintenance (monthly) 32
  32. 32. Reducing logging spend with tracing Annotate spans with logs! It’s as easy as: span.addEvent(“illegal base64 data at input byte 7”) Leverage traces to determine which logs to store 33 Monthly logging spend $3,571 $712 Logging data is more valuable in context!
  33. 33. Deploying Tracing 34
  34. 34. On your tracing migration Tracing is not an all-or-nothing endeavour - How to deliver incremental value for the org - How to use that value to inform next steps of the journey Value to developers should be your (meta-)metric of success journey 35
  35. 35. Step 1 Start w/ customer-critical experiences Look at the edge and build an MVP - As close as you can (reasonably) get to users - Often an API gateway or proxy Map incoming operations → dependencies - Identify next steps - Build a case for others to adopt tracing 36
  36. 36. Step 2 Playbook for service owners Establish conventions for tags, etc. - What matters to your business? - What would explain failures? Instrument frameworks, libraries, shared services - Accelerate adoption by reusing code - Enforce conventions programmatically 37
  37. 37. Step 3 Integrate with existing workflows Where do engineers work today? - IDEs, testing frameworks, CI/CD - Dashboards - Notification and alerting - … 38
  38. 38. Building observable services Use open standards like OpenTelemetry for instrumenting service code. OpenTelemetry provides a single set of APIs, SDKs, and tools for generating distributed traces and metrics from your services. 39
  39. 39. In summary, distributed tracing provides... Faster RCA Better alerts Up-to-date dependency maps Improved compute fan-out Targeted optimization Integrated telemetry 40 Improved developer velocity Faster software performance Better cost management Distributed tracing puts application behavior in context to help answer the primary question of observability: “What caused that change?”
  40. 40. Get Started Today go.lightstep.com/request-a-demo
  41. 41. Q&A Julie Lawson With: With: CTO and Co-founder, Lightstep LinkedIn: /in/spoons Website: lightstep.com Twitter: @save_spoons Daniel “Spoons” Spoonhower Webinar Coordinator, Aggregage LinkedIn: /in/Julie-Lawson Email: julie@aggregage.com Website: saasbrief.com
  42. 42. Daniel “Spoons” Spoonhower, CTO and Co-founder Thank you @save_spoons

