Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2016 Pivotal
!1
An introduction to Distributed Tracing and Zipkin
Adrian Cole, Pivotal
@adrianfcole
How to Properly Blam...
Introduction
introduction
latency analysis
distributed tracing
zipkin
demo
wrapping up
@adrianfcole#zipkin
@adrianfcole
• spring cloud at pivotal
• focused on distributed tracing
• helped open zipkin
Latency Analysis
introduction
latency analysis
distributed tracing
zipkin
demo
wrapping up
@adrianfcole#zipkin
Latency Analysis
Microservice and data pipeline architectures are a often a graph of
components, distributed across a netw...
Why is POST /things slow?
POST /things
When was the event and how long did it take?
First log
statement was
at 15:31:29.103
GMT… last…
15:31:30.530
Server Receiv...
wombats:10.2.3.47:8080
Server log says
Client IP was
1.2.3.4
This is a shard in
the wombats cluster,
listening on
10.2.3.4...
wombats:10.2.3.47:8080
Which event was it?
The http response
header had
“request-id: abcd-
ffe”? Is that what
you mean?
Se...
wombats:10.2.3.47:8080
Is it abnormal?
I’ll check other logs
for this request id
and see what I can
find out.
Server Recei...
wombats:10.2.3.47:8080
Achieving understanding
I searched the logs
for others in that
group.. took about
the same time.
Se...
POST /things
There’s often two sides to the story
Client Sent:15:31:28:500 Client Received:15:31:31:000
Duration: 2500 mil...
and not all operations are on the critical path
Wire Send Store
Async StoreWire Send
POST /things
POST /things
and not all operations are relevant
Wire Send Store Async
Async Store FailedWire Send
POST /things
POST /things
KQueueArra...
Service architecture isn’t this simple anymore
Single-server scenarios aren’t
realistic or don’t fully explain
latency.
Da...
Can we make troubleshooting wizard-free?
We no longer need wizards to
deploy complex architectures.
We shouldn’t need wiza...
Distributed Tracing
introduction
latency analysis
distributed tracing
zipkin
demo
wrapping up
@adrianfcole#zipkin
Distributed Tracing commoditizes knowledge
Distributed tracing systems collect end-to-end latency graphs
(traces) in near ...
Distributed Tracing Vocabulary
A Span is an individual operation that took place. A span
contains timestamped events and t...
wombats:10.2.3.47:8080
A Span is an individual operation
Server Received
POST /things
Server Sent
Events
Tags
Operation
pe...
Tracing is logging important events
Wire Send Store
Async StoreWire Send
POST /things
POST /things
Tracers record time, duration and host
Wire Send Store
Async StoreWire Send
POST /things
POST /things
Tracers send trace data out of process
Tracers propagate IDs in-band,
to tell the receiver there’s a trace in progress
Com...
Tracers usually live in your application
Tracers execute in your production apps! They are written to not log too much,
an...
Tracing Systems are Observability Tools
Tracing systems collect, process and present data reported by tracers.
- aggregate...
Tracing is not just for latency
Some wins unrelated to latency
- Understand your architecture
- Find services that aren’t ...
Zipkin
introduction
latency analysis
distributed tracing
zipkin
demo
wrapping up
@adrianfcole#zipkin
Zipkin is a distributed tracing system
Zipkin has pluggable architecture
Tracers report spans HTTP or Kafka.
Servers collect spans, storing them in
MySQL, Cassan...
Zipkin has starter architecture
Tracing is new for a lot of
folks.
For many, the MySQL option
is a good start, as it is
fa...
Zipkin can be as simple as a single file
$ curl -SL 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-ser...
Zipkin lives in GitHub
Zipkin was created by Twitter in 2012. In 2015, OpenZipkin
became the primary fork.
OpenZipkin is a...
Demo
introduction
latency analysis
distributed tracing
zipkin
demo
wrapping up
@adrianfcole#zipkin
Two Spring Boot (Java) services collaborate over http.
Zipkin will show how long the whole operation took, as
well how muc...
Web requests in the demo are served by Spring MVC controllers.
Tracing of these are automatically performed by Spring Clou...
Wrapping Up
introduction
latency analysis
distributed tracing
zipkin
demo
wrapping up
@adrianfcole#zipkin
Wrapping up
Start by sending traces directly to a zipkin server.
Grow into fanciness as you need it: sampling, streaming, ...
Upcoming SlideShare
Loading in …5
×

How to Properly Blame Things for Causing Latency

1,048 views

Published on

SpringOne Platform 2016
Speaker: Adrian Cole; Software Engineer, Pivotal

Latency analysis is the act of blaming components for causing user perceptible delay. In today's world of microservices, this can be tricky as requests can fan out across polyglot components and even data-centers. In many cases, the root source of latency isn't a component, but rather a link between components.

This session will overview how to debug latency problems, using call graphs created by Zipkin. We'll use trace zipkin itself, setting up from scratch using docker. While we're at it, we'll discuss how the model works, and how to safely trace production. Finally, we'll overview the ecosystem, including tools to trace ruby, c#, java and spring boot apps.

When you leave, you'll at least know something about distributed tracing, and hopefully be on your way to blaming things for causing latency!

Published in: Technology
  • Be the first to comment

How to Properly Blame Things for Causing Latency

  1. 1. © 2016 Pivotal !1 An introduction to Distributed Tracing and Zipkin Adrian Cole, Pivotal @adrianfcole How to Properly Blame Things for Causing Latency
  2. 2. Introduction introduction latency analysis distributed tracing zipkin demo wrapping up @adrianfcole#zipkin
  3. 3. @adrianfcole • spring cloud at pivotal • focused on distributed tracing • helped open zipkin
  4. 4. Latency Analysis introduction latency analysis distributed tracing zipkin demo wrapping up @adrianfcole#zipkin
  5. 5. Latency Analysis Microservice and data pipeline architectures are a often a graph of components, distributed across a network. A call graph or data flow can become delayed or fail due to the nature of the operation, components, or edges between them. We want to understand our current architecture and troubleshoot latency problems, in production.
  6. 6. Why is POST /things slow? POST /things
  7. 7. When was the event and how long did it take? First log statement was at 15:31:29.103 GMT… last… 15:31:30.530 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds
  8. 8. wombats:10.2.3.47:8080 Server log says Client IP was 1.2.3.4 This is a shard in the wombats cluster, listening on 10.2.3.47:8080 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Where did this happen? peer.ipv4 1.2.3.4
  9. 9. wombats:10.2.3.47:8080 Which event was it? The http response header had “request-id: abcd- ffe”? Is that what you mean? Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds peer.ipv4 1.2.3.4 http.request-id abcd-ffe
  10. 10. wombats:10.2.3.47:8080 Is it abnormal? I’ll check other logs for this request id and see what I can find out. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Well, average response time for POST /things in the last 2 days is 100ms peer.ipv4 1.2.3.4 http.request-id abcd-ffe
  11. 11. wombats:10.2.3.47:8080 Achieving understanding I searched the logs for others in that group.. took about the same time. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Ok, looks like this client is in the experimental group for HD uploads peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads
  12. 12. POST /things There’s often two sides to the story Client Sent:15:31:28:500 Client Received:15:31:31:000 Duration: 2500 milliseconds Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds
  13. 13. and not all operations are on the critical path Wire Send Store Async StoreWire Send POST /things POST /things
  14. 14. and not all operations are relevant Wire Send Store Async Async Store FailedWire Send POST /things POST /things KQueueArrayWrapper.kev UnboundedFuturePool-2 SelectorUtil.select LockSupport.parkNan ReferenceQueue.remove
  15. 15. Service architecture isn’t this simple anymore Single-server scenarios aren’t realistic or don’t fully explain latency. David Vignoni Gnome-fs-server.svg
  16. 16. Can we make troubleshooting wizard-free? We no longer need wizards to deploy complex architectures. We shouldn’t need wizards to troubleshoot them, either!
  17. 17. Distributed Tracing introduction latency analysis distributed tracing zipkin demo wrapping up @adrianfcole#zipkin
  18. 18. Distributed Tracing commoditizes knowledge Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time. You can compare traces to understand why certain requests take longer than others.
  19. 19. Distributed Tracing Vocabulary A Span is an individual operation that took place. A span contains timestamped events and tags. A Trace is an end-to-end latency graph, composed of spans.
  20. 20. wombats:10.2.3.47:8080 A Span is an individual operation Server Received POST /things Server Sent Events Tags Operation peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads
  21. 21. Tracing is logging important events Wire Send Store Async StoreWire Send POST /things POST /things
  22. 22. Tracers record time, duration and host Wire Send Store Async StoreWire Send POST /things POST /things
  23. 23. Tracers send trace data out of process Tracers propagate IDs in-band, to tell the receiver there’s a trace in progress Completed spans are reported out-of-band, to reduce overhead and allow for batching
  24. 24. Tracers usually live in your application Tracers execute in your production apps! They are written to not log too much, and to not cause applications to crash. - propagate structural data in-band, and the rest out-of-band - have instrumentation or sampling policy to manage volume - often include opinionated instrumentation of layers such as HTTP
  25. 25. Tracing Systems are Observability Tools Tracing systems collect, process and present data reported by tracers. - aggregate spans into trace trees - provide query and visualization for latency analysis - have retention policy (usually days)
  26. 26. Tracing is not just for latency Some wins unrelated to latency - Understand your architecture - Find services that aren’t used - Reduce time spent on triage
  27. 27. Zipkin introduction latency analysis distributed tracing zipkin demo wrapping up @adrianfcole#zipkin
  28. 28. Zipkin is a distributed tracing system
  29. 29. Zipkin has pluggable architecture Tracers report spans HTTP or Kafka. Servers collect spans, storing them in MySQL, Cassandra, or Elasticsearch. Users query for traces via Zipkin’s Web UI or Api. services: storage: image: openzipkin/zipkin-cassandra:1.6.0 container_name: cassandra ports: - 9042:9042 server: image: openzipkin/zipkin:1.6.0 environment: - STORAGE_TYPE=cassandra - CASSANDRA_CONTACT_POINTS=cassandra ports: - 9411:9411 depends_on: - storage
  30. 30. Zipkin has starter architecture Tracing is new for a lot of folks. For many, the MySQL option is a good start, as it is familiar. services: storage: image: openzipkin/zipkin-mysql:1.6.0 container_name: mysql ports: - 3306:3306 server: image: openzipkin/zipkin:1.6.0 environment: - STORAGE_TYPE=mysql - MYSQL_HOST=mysql ports: - 9411:9411 depends_on: - storage
  31. 31. Zipkin can be as simple as a single file $ curl -SL 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec' > zipkin.jar $ SELF_TRACING_ENABLED=true java -jar zipkin.jar . ____ _ __ _ _ / / ___'_ __ _ _(_)_ __ __ _ ( ( )___ | '_ | '_| | '_ / _` | / ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |___, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v1.4.0.RELEASE) 2016-08-01 18:50:07.098 INFO 8526 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole with PID 8526 (/Users/acole/oss/sleuth-webmvc-example/zipkin.jar started by acole in /Users/acole/oss/sleuth-webmvc-example) —snip— $ curl -s localhost:9411/api/v1/services|jq . [ "zipkin-server" ]
  32. 32. Zipkin lives in GitHub Zipkin was created by Twitter in 2012. In 2015, OpenZipkin became the primary fork. OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec, service components and docker images. https://github.com/openzipkin
  33. 33. Demo introduction latency analysis distributed tracing zipkin demo wrapping up @adrianfcole#zipkin
  34. 34. Two Spring Boot (Java) services collaborate over http. Zipkin will show how long the whole operation took, as well how much time was spent in each service. https://github.com/adriancole/sleuth-webmvc-example Distributed Tracing across Spring Boot apps
  35. 35. Web requests in the demo are served by Spring MVC controllers. Tracing of these are automatically performed by Spring Cloud Sleuth. Spring Cloud Sleuth reports to Zipkin via HTTP by depending on spring-cloud-sleuth-zipkin. https://cloud.spring.io/spring-cloud-sleuth/ Spring Cloud Sleuth Java
  36. 36. Wrapping Up introduction latency analysis distributed tracing zipkin demo wrapping up @adrianfcole#zipkin
  37. 37. Wrapping up Start by sending traces directly to a zipkin server. Grow into fanciness as you need it: sampling, streaming, etc Remember you are not alone! @adrianfcole#zipkin gitter.im/spring-cloud/spring-cloud-sleuth gitter.im/openzipkin/zipkin

×