Distributed Tracing at UBER Scale
Creating a treasure map
for your monitoring data
Yuri Shkuro, UBER Technologies
ABOUT ME
• Software Engineer on the
Observability team in NYC
• Working on the open source
distributed tracing system Jaeger
• Co-founded the OpenTracing
project
• Banking industry survivor
• Github: yurishkuro
• Twitter: @yurishkuro
Would You Like Some Tracing with
Your Monitoring?
What does it take to roll it out?
Why Distributed Tracing
• Distributed transaction monitoring
• Performance / latency optimization
• Root cause analysis
• Service dependency analysis
• Distributed context propagation (“baggage”)
JAEGER, Distributed Tracing
• Open Source
• OpenTracing inside
• In active development
• PRs are welcome
• Zipkin compatible
• github.com/uber/jaeger
Who Thinks Tracing is Awesome?
Why Doesn’t Everyone Do Tracing?
Tracing Instrumentation is
HARD
EXPENSIVE
BORING
Instrumentation
• Metrics and logging are not new
• Tracing is both new and harder
Context Propagation
A
B
C
D
E
{context}
{context}
{context}
{context}
UniqueID → {context}
Edgeservice
Headers:
. . .
Trace ID
. . . Instrumentation
APPLICATION / MICROSERVICE
Handler
Context
[Span]
Client
Context
[Span]
Inbound
HTTP
Request
Instrumentation
Headers:
. . .
Trace ID
. . .
Outbound
HTTP
Request
Context Propagation
In-Process Context Propagation
Implicit, via Thread-Locals
but: thread pools, futures
Explicit
It’s Also the Frameworks
• Go: stdlib, gorilla, …
• Java: jaxrs2, okhttp, ApacheHttpClient, …
• Python: Flask, Django, Tornado, urllib2, …
• Node.js – who knows…
OpenTracing to the Rescue
No Help With In-Process Propagation
• Must be done manually
• UBER has 2000-3000 microservices
• Resources of the tracing team are limited
• Developers must instrument their code!
BITE MAKE ME!
How do we mobilize the org?
Traveling Salesman Problem
2017 edition
They Must Want Your Product
or Sticks and Carrots
Recap: Why Distributed Tracing
• Distributed transaction monitoring
• Performance / latency optimization
• Root cause analysis
• Service dependency analysis
• Distributed context propagation (“baggage”)
Service Dependency Analysis
• Explain to us what we just built
• Who are my dependencies
• Workflow analysis
• Where is all this traffic coming from?
• Service tiers
Baggage
• Tenancy, test or production
– Set at the top
– Used at the storage layer, prod or test DB
• Authentication tokens
– Signed user or service identity
– Checked at multiple levels
Sticks and Carrots
• Get other teams build features on top
– Performance team
– Capacity & cost accounting
– Baggage
• More carrots
• Eventually they become sticks (peer pressure)
Each Organization is Different
Find what works best
How to Measure Adoption?
Measure everything
Does Service X Report Traces?
• Daily aggregation job
• Auto-book tickets
• Build a dashboard
• Pass/Fail: too easy to pass
Trace Quality Score
• Inspect traces
– See a caller, but no spans
• Join with other data
– Routing logs
• Auto-book tickets (carefully, not for everyone)
– With detailed report
Trace Quality Metrics by Service
Thank You
• Jaeger
– https://github.com/uber/jaeger
– Blog: Evolving Distributed Tracing at UBER
– Blog: Take OpenTracing for a HotROD Ride
• OpenTracing: http://opentracing.io/
• We are hiring
• @yurishkuro

Distributed Tracing at UBER Scale: Creating a treasure map for your monitoring data

  • 1.
    Distributed Tracing atUBER Scale Creating a treasure map for your monitoring data Yuri Shkuro, UBER Technologies
  • 2.
    ABOUT ME • SoftwareEngineer on the Observability team in NYC • Working on the open source distributed tracing system Jaeger • Co-founded the OpenTracing project • Banking industry survivor • Github: yurishkuro • Twitter: @yurishkuro
  • 3.
    Would You LikeSome Tracing with Your Monitoring? What does it take to roll it out?
  • 4.
    Why Distributed Tracing •Distributed transaction monitoring • Performance / latency optimization • Root cause analysis • Service dependency analysis • Distributed context propagation (“baggage”)
  • 5.
    JAEGER, Distributed Tracing •Open Source • OpenTracing inside • In active development • PRs are welcome • Zipkin compatible • github.com/uber/jaeger
  • 6.
    Who Thinks Tracingis Awesome?
  • 9.
  • 10.
  • 11.
    Instrumentation • Metrics andlogging are not new • Tracing is both new and harder
  • 12.
  • 13.
    Headers: . . . TraceID . . . Instrumentation APPLICATION / MICROSERVICE Handler Context [Span] Client Context [Span] Inbound HTTP Request Instrumentation Headers: . . . Trace ID . . . Outbound HTTP Request Context Propagation
  • 14.
    In-Process Context Propagation Implicit,via Thread-Locals but: thread pools, futures Explicit
  • 15.
    It’s Also theFrameworks • Go: stdlib, gorilla, … • Java: jaxrs2, okhttp, ApacheHttpClient, … • Python: Flask, Django, Tornado, urllib2, … • Node.js – who knows…
  • 16.
  • 17.
    No Help WithIn-Process Propagation • Must be done manually • UBER has 2000-3000 microservices • Resources of the tracing team are limited • Developers must instrument their code!
  • 18.
    BITE MAKE ME! Howdo we mobilize the org?
  • 19.
  • 20.
    They Must WantYour Product or Sticks and Carrots
  • 21.
    Recap: Why DistributedTracing • Distributed transaction monitoring • Performance / latency optimization • Root cause analysis • Service dependency analysis • Distributed context propagation (“baggage”)
  • 22.
    Service Dependency Analysis •Explain to us what we just built • Who are my dependencies • Workflow analysis • Where is all this traffic coming from? • Service tiers
  • 23.
    Baggage • Tenancy, testor production – Set at the top – Used at the storage layer, prod or test DB • Authentication tokens – Signed user or service identity – Checked at multiple levels
  • 24.
    Sticks and Carrots •Get other teams build features on top – Performance team – Capacity & cost accounting – Baggage • More carrots • Eventually they become sticks (peer pressure)
  • 25.
    Each Organization isDifferent Find what works best
  • 26.
    How to MeasureAdoption? Measure everything
  • 27.
    Does Service XReport Traces? • Daily aggregation job • Auto-book tickets • Build a dashboard • Pass/Fail: too easy to pass
  • 28.
    Trace Quality Score •Inspect traces – See a caller, but no spans • Join with other data – Routing logs • Auto-book tickets (carefully, not for everyone) – With detailed report
  • 29.
  • 30.
    Thank You • Jaeger –https://github.com/uber/jaeger – Blog: Evolving Distributed Tracing at UBER – Blog: Take OpenTracing for a HotROD Ride • OpenTracing: http://opentracing.io/ • We are hiring • @yurishkuro

Editor's Notes

  • #2 Thank you all for staying till the very last time segment of the conference. I am sure everyone is tired, so I promise there will be no math or cat pictures in this talk.
  • #3 I am an engineer on the Observability team in NYC, here is a picture of me. It’s actually most of the NY office, both engineers and operations, on the High Line next to Hudson Yards. The Uber office is just two streets to the right. And in the direction we’re all facing, there are no buildings, so I leave it to your imagination how the picture was taken. I work on the distributed tracing system Jaeger that we recently open sourced. I was one of the co-founders of the OpenTracing project, which I will talk about later. And before joining Uber I spent way more time than any engineer should working in finance technology.
  • #4 Many speakers in the last three days mentioned how valuable tracing can be.
  • #9 The poll is biased. Show of hands: - How many people are using some metrics system in their organization? - And how many people have a fully functional distributed tracing solution? - That works across multiple programming languages
  • #11 These adjectives actually refer to the same thing – engineers must spend time adding this instrumentation. You could say, well it’s the same thing with other monitoring, like metrics and logging. But
  • #12 Engineers are used to metrics and logging. Tracing instrumentation is new, and harder to get right (or more boring) Why is that?
  • #13 Tracing instrumentation requires context propagation. It means the architecture must be capable of passing around certain metadata about the distributed transaction. What does it look like in a single node?
  • #14 Here we have an application or microservice, that has a handler for inbound HTTP requests, and it makes some outbound calls to other services. The instrumentation in purpose is responsible for extracting the distributed context from the request, and converting it to some in-memory representation. Then the service itself must make sure that the context object is available to instrumentation on the right to be encoded into the outbound request. In case of HTTP, the context is typically passed as one or more HTTP headers. Obvious: start with infrastructure libraries The difficult part here is actually the dotted line in the middle, passing the context inside the application – in-process propagation.
  • #15 Some languages like Java and Python support thread-local storage. We can use that to pass the context implicitly. Happy story for instrumentation, no changes to the application’s code. Other languages do not have thread locals. For example, Go has no way of even identifying a go-routine. Node.js has CLS, but it’s not performant enough. So we need to explicitly pass the context, which means changes to the application code. Google / twitter experience - localized
  • #16 When I said earlier “start with the infra libraries” – that’s also not so simple in a polyglot environment. You are starting to see the problem? This is why we stood behind OpenTracing from the early days
  • #17 OpenTracing is an open, vendor neutral standard. Frameworks can instrument themselves! Or at minimum, it can be done via open source plugins that everybody can reuse.
  • #18 But OpenTracing does not help with the need for explicit in-process context propagation in languages like Go and Node.js It has to be done manually, there is no other way. Now consider that Uber has somewhere between 2 and 3.000 microservices, and only 5 people on the tracing team. It doesn’t scale, we cannot go and change the source code of half of those services. Inevitable conclusion – application owners must do that themselves.
  • #19 But they are busy shipping features. How do we incentivize them?
  • #20 I call it a traveling salesman problem, 2017 edition. Find the optimal way to spend your tracing team’s time to achieve maximum level of adoption. It’s not a pure technology problem, there’s a salesmanship involved.
  • #21 How do we make developers use ANY given tool? You need to give them something they want. A sticks and carrots approach can also work, because engineering organizations are not democracies.
  • #22 Remember we talked about these features? It turns out the last two