Distributed tracing
with Jaeger and OpenTracing
Which requests are slow? Why?
Lookup in Kibana:
Lacks context, why was it slow?
Grafana/Graphite:
We know what operations were slow but can’t
correlate them to any particular request
• Unclear how to aggregate data
• Lack of transparency and extensibility
Problem definition
• Understanding individual request behaviour
- Look at a specific request, not just an aggregated view
- Search by request ID
• Identification of performance bottlenecks
- Given a particular request data identify which operation was
causing performance issues
• Structured data access
- View N slowest request
- Search for a request on a particular shard/host or from a
particular client
Distributed tracing to the rescue!
• Pass around unique ID (Trace ID)
from service entry point to the end of execution
• Build a DAG of related operations
• Contextualise metadata
Vendor-neutral open API standard for distributed
tracing
• Decoupling tracing backend from client
• Library/framework integrations
- Integration via middleware for cross-service
tracing for a variety of transports
Metadata
•Tags
•Baggage
•Events
•Logs
How are we using OpenTracing
Integration with StreamGoKit:
Timer metrics can create spans via
*WithTracing method versions
Integration with API and Proxy servers:
Incoming request starts span, extracts relevant
metadata (shard, api key, etc.)
Tracing span information passed in code via context.Context object
Cross-service trace linking (Trace ID passed in HTTP headers/gRPC metadata)
func myOperation(...) {

   timer, subcontext := metrics.StartWithTracing(context, "<operation>")

   defer timer.Stop()



   // timed operation



   subTimer, _ := metrics.StartWithTracing(subcontext, "<sub-operation>")

   // timed sub-operation

   subTimer.Stop()



   // timed operation continues

}
Jaeger
• Distributed tracing system released as open source
by Uber Technologies
- v1.0 released Dec 2017
• Written in Go
Flexible pipeline
• Agent
- abstracts routing and discovery of
collectors away from the client
- allows for adaptive sampling rate to
be implemented
• Collector
• UI for querying
Jaeger issues
Authentication for agent        collector communication
Authentication for collector        storage        UI
UI is barely usable
Our deployment now (WIP)
• Collectors collocated with agents on service
machines
• Proxy, API, DB servers partially instrumented
• ElasticSearch as storage
Distributed tracing with OpenTracing and Jaeger @ getstream.io

Distributed tracing with OpenTracing and Jaeger @ getstream.io

  • 1.
  • 2.
    Which requests areslow? Why? Lookup in Kibana: Lacks context, why was it slow? Grafana/Graphite: We know what operations were slow but can’t correlate them to any particular request • Unclear how to aggregate data • Lack of transparency and extensibility
  • 3.
  • 4.
    • Understanding individualrequest behaviour - Look at a specific request, not just an aggregated view - Search by request ID • Identification of performance bottlenecks - Given a particular request data identify which operation was causing performance issues • Structured data access - View N slowest request - Search for a request on a particular shard/host or from a particular client
  • 5.
  • 7.
    • Pass aroundunique ID (Trace ID) from service entry point to the end of execution • Build a DAG of related operations • Contextualise metadata
  • 10.
    Vendor-neutral open APIstandard for distributed tracing • Decoupling tracing backend from client • Library/framework integrations - Integration via middleware for cross-service tracing for a variety of transports
  • 11.
  • 12.
    How are weusing OpenTracing
  • 13.
    Integration with StreamGoKit: Timermetrics can create spans via *WithTracing method versions Integration with API and Proxy servers: Incoming request starts span, extracts relevant metadata (shard, api key, etc.)
  • 14.
    Tracing span informationpassed in code via context.Context object Cross-service trace linking (Trace ID passed in HTTP headers/gRPC metadata) func myOperation(...) {
    timer, subcontext := metrics.StartWithTracing(context, "<operation>")
    defer timer.Stop()
 
    // timed operation
 
    subTimer, _ := metrics.StartWithTracing(subcontext, "<sub-operation>")
    // timed sub-operation
    subTimer.Stop()
 
    // timed operation continues
 }
  • 15.
  • 16.
    • Distributed tracingsystem released as open source by Uber Technologies - v1.0 released Dec 2017 • Written in Go
  • 17.
    Flexible pipeline • Agent -abstracts routing and discovery of collectors away from the client - allows for adaptive sampling rate to be implemented • Collector • UI for querying
  • 18.
    Jaeger issues Authentication foragent        collector communication Authentication for collector        storage        UI UI is barely usable
  • 19.
    Our deployment now(WIP) • Collectors collocated with agents on service machines • Proxy, API, DB servers partially instrumented • ElasticSearch as storage