gives rise to a range of benefits including individual scaling and individual deployments. However, it also introduces challenges regarding configuration management, load balancing, and latency analysis. Reshmi Krishna discusses how companies like Twitter analyze microservices latency in real time and demonstrates how to integrate popular distributed tracing tools like Zipkin into an existing application with just a few lines of code. At the end, we will also see a demo of tracing capabilities from PCF Metrics.
13. Troubleshooting Latency issues
When was the event? How long did it take?
How do I know it was slow?
Why did it take so long?
Which microservice was responsible?
14. Distributed Tracing
Distributed Tracing is a process of collecting end-to-end transaction graphs in near real
time
A trace represents the entire journey of a request
A span represents single operation call
Distributed Tracing Systems are often used for this purpose. Zipkin is an example
As a request is flowing from one microservice to another, tracers add logic to create
unique trace Id, span Id
15. Tracers
Tracers add logic to create unique trace ID
Trace ID is generated when the first request is made
Span ID is generated as the request arrives at each microservice
Example tracer is Spring Cloud Sleuth
Tracers execute in your production apps! They are written to not log too much
Tracers have instrumentation or sampling policy
16. Visualization - Traces & Spans
service1
Trace Id : 1, Span Id : 1
service4
Trace Id : 1, Parent Id : 2, Span Id : 4
service2
Trace Id : 1, Parent Id : 1, Span Id : 2
service3
Trace Id : 1, Parent Id : 2, Span Id : 3
17. Dapper Paper By Google
@reshmi9k
@reshmi9k
This paper described Dapper, which is Google’s production distributed systems
tracing infrastructure
Design Goals :
Low overhead
Application-level transparency
Scalability
18. Zipkin
Zipkin is a distributed tracing system
Implementation based on Dapper paper, Google
Aggregate spans into trace trees
Manages both collection and lookup of the data
In 2015, OpenZipkin became the primary fork
20. Demo : Architecture Diagram
Spring Cloud
Sleuth
Collector
Span
Store
Transport
Mq/Http/Log
Spring Cloud
Sleuth
Spring Cloud
Sleuth
Spring Cloud
Sleuth
Query
ServerZipkin UI
ZIPKIN
APP
APP
APP
APP
22. Links
Dapper, Google : http://research.google.com/pubs/pub36356.html
Code for this presentation : https://github.com/reshmik/DistributedTracingDemo_Velocity2016.git
Sleuth’s documentation: http://cloud.spring.io/spring-cloud-sleuth/spring-cloud-sleuth.html
Repo with Spring Boot Zipkin server: https://github.com/openzipkin/zipkin-java
Zipkin deployed as an PCF https://github.com/spring-cloud-samples/sleuth-documentation-
apps/tree/master/zipkin-server
Pivotal Web Services trial : https://run.pivotal.io/
PivotalCloudFoundry on your laptop : https://docs.pivotal.io/pcf-dev/
@reshmi9k
Editor's Notes
A monolith usually looks like a big ball of mud with entangled dependencies, lack of cohesion, direct DB queries instead of using interfaces and APIs. It does NOT do one thing very well. It usually does a lot of things, which become brittle and difficult to reason on.
All functionality must be deployed together
No Language and framework heterogeneity
More likely a failure will cascade resulting in a reliance reduction - brittle - high risk deployment
Scale vertically or limited horizontal scaling of everything at once
Large team - anti agile
Harder to reuse
Harder to modify - thousands of lines of hard to understand code
Harder to replace - meantime to recovery is limited
Getting up to speed
Wikipedia: A big ball of mud is a software system that lacks a perceivable architecture. Although undesirable from a software engineering point of view, such systems are common in practice due to business pressures, developer turnover and code entropy. They are a type of design anti-pattern.
Death Star architecture by Adrian Cockcroft
As visualized by App Dynamics, Boundary.com and Twitter internal tools
A trace represents the entire journey of a request
A span is a basic unit of work
Span id is identified by an unique 64-bit id
Trace id is identified by a 64-bit id, which the span is part of
A span contains timestamped records, any RPC timing data, and zero or more application-specific annotations
The trace give u the structure through which you can identify your calls. You can you can think about trace as a tree and the tree nodes as spans.
The edges indicate a casual relationship between a span and its parent span. Independent of its place in a larger trace tree, though, a span is also a simple log of timestamped records which
encode the span’s start and end time, any RPC timing data, and zero or more application-specific annotations
Dapper was published in 2010
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf
Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data. Zipkin’s design is based on the Google Dapper paper.
Started as a project in first hack week.
Initial version of Dapper paper was implemented for Thrift
Today it has grown to include support for tracing Http, Thrift, Memcache, SQL and Redis requests.
The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
Tracers collect timing data and transport it over HTTP or Kafka. We use Scribe to transport all the traces from the different services to Zipkin and Hadoop. Scribe was developed by Facebook and it’s made up of a daemon that can run on each server in your system. It listens for log messages and routes them to the correct receiver depending on the category.
Once the trace data arrives at the Zipkin collector daemon we check that it’s valid, store it and the index it for lookups.
Zipkin was originally built with Cassandra for storage. It was scalable, had a flexible schema, and is heavily used within Twitter. However, this component is now pluggable, and now we have support for Redis, HBase, MySQL, PostgreSQL, SQLite, and H2.
Users query for traces via Zipkin’s Web UI or Api.
Tracers add logic to create unique trace ID
Trace ID is generated when the first request is made
Span Id is generated as the request arrives at each microservice
Example tracer is Spring Cloud Sleuth
Tracers execute in your production apps! They are written to not log too much
Tracers have instrumentation or sampling policy to manage volumes of traces and spans