Swift Distributed Tracing Method and Tools
by Zhang Hua (Edward)
Standards Team/ETI/CDL/IBM
Agenda
 Background
 Tracing Proposal
 Tracing Architecture
 Tracing Data Model
 Tracing Analysis Tools
 Reference
Background
• Swift is a large scale distributed object store span thousands of nodes
across multiple zones and different r...
Which part is slow? Looking at your logs?
When a request is made to Swift, it is given an unique transaction id. This id s...
Correlate the logs
Proxy server log @ node-P
Container server log @ node-C
Account server log @ node-A
Object server log @...
• Counters + Counter_rate(sampling)
– Proxy-Server.{ACO}.{METHOD}.{CODE}
– {ACO}-server.{METHOD}.{CODE}
• Timers + Timer_d...
Pros and cons of current implt.
• ReThink it
Can we provide a real time end to end performance tracing/tracking tool in Sw...
Our Proposal
• Goal
– Target for researchers, developers and admins, provide a method of traceability to
understand end to...
Swift Messaging Route
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Acc...
Span Tree of Trace
Swift
Client
Proxy
Server
Container
Server
Container
Server
Container
Server
Account
Server
Auth
Accoun...
X-trace Middleware Architecture
1. Generate trace ids based on configuration.
2. Create spans and collect trace data
3. Pr...
Patches to fix the request path
• The trace id is passed along by proxy
server in HTTP headers, but will be lost
at some p...
Tie together tracing data
Reconstruct causal and temporal relationship view for PUT container call
Proxy-Server.PUT parent...
Another example: upload an object
Proxy-Server.PUT parent-span-id=0, span-id=1
timeline
Object-Server.PUT parent-span-id=1...
pipeline:main
Trace into middleware of the pipeline
• Expand the trace path into
WSGI call b/w middleware to
get more comp...
Backend trace data model
{
"_id" : "14a467a402904aee87de4028a8595493",
"endpoint" : {
"port" : "6031",
"type" : "server",
...
Query and analysis tools
• Query
– Query trace data by trace_id, span_id, order or range by time, group by nodes,
annotati...
Reference
• Google Dapper – a large-scale distributed systems tracing infrastructure
• Twitter Zipkin - a distributed trac...
Demo
Q&A
Upcoming SlideShare
Loading in...5
×

Swift distributed tracing method and tools v2

422

Published on

A proposal of Swift session for OpenStack Atlanta design summit.
http://junodesignsummit.sched.org/event/0f185cd5bcc2c9b58c639bba25bc0025#.U3SZRa1dXd4
http://summit.openstack.org/cfp/details/354

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
422
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Swift distributed tracing method and tools v2

  1. 1. Swift Distributed Tracing Method and Tools by Zhang Hua (Edward) Standards Team/ETI/CDL/IBM
  2. 2. Agenda  Background  Tracing Proposal  Tracing Architecture  Tracing Data Model  Tracing Analysis Tools  Reference
  3. 3. Background • Swift is a large scale distributed object store span thousands of nodes across multiple zones and different regions. – End to end performance is critical to success of Swift. – Tools that aid in understanding the behavior and reasoning about performance issue are invaluable. • Motivation – For a particular client request X, what is the actual route when it is being served by different services? Is there any difference b/w actual route and expected route even we know the access patterns? – What is the performance behavior of the server components and third-party services? Which part is slower than expected? – How can we quickly diagnose the problem when it breaks at some points ? e.g. PUT request X: Client(1) X Proxy-Server (1) Container-Server (1) X1” Account-Server (1) X ’ Container-Server (2) X2” Account-Server (2) Container-Server (3) X3” Account-Server (3)
  4. 4. Which part is slow? Looking at your logs? When a request is made to Swift, it is given an unique transaction id. This id should be in every log line that has to do with that request. This can be useful when looking at all the services that are hit by a single request. But….is it efficient or handy to do?
  5. 5. Correlate the logs Proxy server log @ node-P Container server log @ node-C Account server log @ node-A Object server log @ node-O Correlate the information pieces by transaction id and client IP from all logs of related hashed nodes!
  6. 6. • Counters + Counter_rate(sampling) – Proxy-Server.{ACO}.{METHOD}.{CODE} – {ACO}-server.{METHOD}.{CODE} • Timers + Timer_data – {ACO}-{DAEMON}.timing – {ACO}-{DAEMON}.error.timing – {ACO}-server.{METHOD}.timing StatsD Metrics StatsD logging options: # access_log_statsd_host = localhost # access_log_statsd_port = 8125 # access_log_statsd_default_sample_rate = 1.0 # access_log_statsd_sample_rate_factor = 1.0 # access_log_statsd_metric_prefix = # access_log_headers = false # log_statsd_valid_http_methods = GET,HEAD,POST,PUT,DELETE,COPY,OPTIONS
  7. 7. Pros and cons of current implt. • ReThink it Can we provide a real time end to end performance tracing/tracking tool in Swift infrastructure for developers and users to facilitate their analysis in development and operation environment? statsD logging Pros • Real time performance metrics to monitor the health of Swift cluster • Performance impact is low by sending metrics data via UDP protocol, no hit on local disk I/O • Supported by different backend to report and visualization • Light-weighted • Simple to use • Rich logging tools cons • Designed for cluster level healthy, not for end to end performance. • Can not provide metrics data for a specific set of requests. • No relationship between different set of metrics for specific transactions or requests. • Not designed for real time • Require more efforts to collect and analysis • No representation for individual span • Message size limitation
  8. 8. Our Proposal • Goal – Target for researchers, developers and admins, provide a method of traceability to understand end to end performance issue and identify the bottlenecks. • Scope  Add WSGI middleware and hooks into swift components to collect trace data  The middleware to control the activation and generation of trace  Generate trace and span ids, collect the data and tired them together  Send traced data to aggregator and saved into repository  Minor fix of current Swift implementation to allow the path to include complete hops.  Similar to trans-id, the trace-id and span-id need to be propagated through HTTP headers correctly b/w services and components.  Analysis tools of report and visualization  Query the traced data by tiered trace ids  Reconstruct span tree for each trace
  9. 9. Swift Messaging Route Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server Request-XPUT Response-XPUT Request-X’’PUT Request-X”’PUT Response- X’”PUT Response-X’’PUT Create a new container: PUT /account/container • Swift components talks via HTTP request and response messages. • It is easy to use HTTP headers as the clue to trace down the route. Request-X’GET Response-X’GET
  10. 10. Span Tree of Trace Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server Request-XPUT X-Trace-Id: 1234 Response-XPUT Request-X’’PUT X-Trace_Id: 1234 X-Span-Id: 1 Request-X”’PUT X-Trace-Id: 1234 X-Span-Id: 2 Response- X’”PUT Response-X’’PUT • X-Trace-Id: identification of each trace  Use X-Trans-Id to support different cluster?  Or generate new id for this purpose? • X-Span-Id: identification of each span to represent individual HTTP RESTful call and WSGI call.  Generate new span id for this purpose (notes: UUID can be used for implementation) Create a new container: PUT /account/container Request-X’GET Response-X’GET
  11. 11. X-trace Middleware Architecture 1. Generate trace ids based on configuration. 2. Create spans and collect trace data 3. Propagate trace ids to next hop 4. Send trace data into a repository via separate transport protocol/channel Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server x-trace x-trace x- trace Tracedatarepository x-trace
  12. 12. Patches to fix the request path • The trace id is passed along by proxy server in HTTP headers, but will be lost at some points because of recreating a new request for next hops. • Patches are needed to fix this problem to form a complete tracing path for container server, object server, etc. Swift Client Proxy Server Container Server Container Server Container Server Account Server Auth Account Server Account Server x-trace x-trace x- trace Tracedatarepository x-tracepropagate trace id in next new request
  13. 13. Tie together tracing data Reconstruct causal and temporal relationship view for PUT container call Proxy-Server.PUT parent-span-id=0, span-id=1 timeline Container-Server.PUT parent-span-id=1, span-id=2 Container-Server.PUT parent-span-id=1, span-id=3 Container-Server.PUT parent-span-id=1, span-id=4 Account-Server.PUT parent-span-id=2, span-id=5 Account-Server.PUT parent-span-id=3, span-id=6 Account-Server.PUT parent-span-id=4, span-id=7 0 ms 200 ms50 ms 150 ms100 ms Swift-Client.PUT parent-span-id=none, span-id=0 201 201 201 201 201 201 201
  14. 14. Another example: upload an object Proxy-Server.PUT parent-span-id=0, span-id=1 timeline Object-Server.PUT parent-span-id=1, span-id=2 Object-Server.PUT parent-span-id=1, span-id=3 Object-Server.PUT parent-span-id=1, span-id=4 Container-Server.PUT parent-span-id=2, span-id=5 Container-Server.PUT parent-span-id=3, span-id=6 Container-Server.PUT parent-span-id=4, span-id=7 0 ms 200 ms50 ms 150 ms100 ms Swift-Client.PUT parent-span-id=none, span-id=0 201 201 201 201 201 201 201
  15. 15. pipeline:main Trace into middleware of the pipeline • Expand the trace path into WSGI call b/w middleware to get more complete trace data. • Possible choices – Decorators for __call__ @trace_here() def __call__(self, environ, start_response) – Hack paste deployment package – Profile with filters Swift Client Proxy Server x-trace Tracedatarepository tempauth cache tempurl dlo Pipeline = catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk slo dlo ratelimit crossdomain tempauth tempurl formpost staticweb container-quotas account-quotas proxy-logging proxy-serve slo …
  16. 16. Backend trace data model { "_id" : "14a467a402904aee87de4028a8595493", "endpoint" : { "port" : "6031", "type" : "server", "name" : "container.server", "ipv4" : "127.0.0.1" }, "name" : "GET", "parent" : "57fbd3ec12fe4912ba89e7a8eb97f2e7", "start_time" : 1400146616.554865, "trace_id" : "d7ff028674c5471e94b964ec37d35546", "end_time" : 1400146616.559608, "annotations" : [ { "type" : "string", "value" : "/sdb1/347/TEMPAUTH_test/summit", "key" : "request_path", "event" : "sr" }, { "type" : "string", "value" : "200 OK", "key" : "return_code", "event" : "ss" } ] } { "_id" : "57fbd3ec12fe4912ba89e7a8eb97f2e7", "endpoint" : { "port" : "8080", "type" : "server", "name" : "proxy.server", "ipv4" : "127.0.0.1" }, "name" : "GET", "parent" : "5602ca4010fe420c9fa56528faf711ab", "start_time" : 1400146616.490691, "trace_id" : "d7ff028674c5471e94b964ec37d35546", "end_time" : 1400146616.58012, "annotations" : [ { "type" : "string", "value" : "/v1/TEMPAUTH_test/summit", "key" : "request_path", "event" : "sr" }, { "type" : "string", "value" : "200 OK", "key" : "return_code", "event" : "ss" } ] }
  17. 17. Query and analysis tools • Query – Query trace data by trace_id, span_id, order or range by time, group by nodes, annotation keys • Trace timeline – Plot the spans on the timeline with causal relationships • Diagnose – Analyze the critical path for a success response – Identify the failure point of in the path • Simulation – Replay the recorded processing of the requests • Data Mining
  18. 18. Reference • Google Dapper – a large-scale distributed systems tracing infrastructure • Twitter Zipkin - a distributed tracing system that helps us gather timing data for all the disparate services at Twitter. • Berkeley XTrace : a pervasive network tracing framework
  19. 19. Demo
  20. 20. Q&A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×