Completing a transition to a microservices-based architecture makes every software engineer feel good. You can be proud of requests spanning multiple individual services, each with isolated single responsibility. Exactly as you dreamed it would be.
In the course of this transition however, you will have also created several new problems. Among these is a whole new level of complexity related to understanding the behavior of the application when troubleshooting a problem. If you have ever wrestled with pinpointing the exact root cause during a post-mortem, this talk is for you.
We will show you how capturing the runtime transparency of the distributed and dynamic architecture is possible. Better yet, we will cover both simple and advanced examples about how taking this route gives you an objective and evidence-based ability to zoom in to the problem.
After attending the talk you will understand how distributed tracing will help your team during incident response and post-mortems.
Register today to learn more:
What are distributed traces
Different ways to add distributed tracing to your production services
How the distributed traces expose the runtime architecture of your microservices in production.
Examples of how a distributed trace highlights a problem
Advanced examples of how distributed traces map root causes to real user impact
2. What we are going to cover today?
Understanding the need for distributed traces and the general concepts
Examples of how a distributed traces help you to locate the root cause
Advanced examples of how distributed traces map root causes to real user impact
Different ways to add distributed tracing to your production services
Plumbr - sign up for your free trial a https://www.plumbr.io
3. How did we get to distributed services?
Software is eating the world
More and more major businesses and industries are being run
on software and delivered as online services.
----- Marc Andreessen, 2011
Plumbr - sign up for your free trial a https://www.plumbr.io 3
4. Software is eating the world faster
Large companies are forced to take plays from start-ups’
playbooks to stay competitive. Enterprises are under pressure
to innovate faster in order to stay in business.
----- McKinsey, 2019
Plumbr - sign up for your free trial a https://www.plumbr.io 4
5. Implications for the IT teams
Moving from monoliths to
microservices to enable
innovation in individual teams.
Adopting devops practices within
IT to support faster innovation
Plumbr - sign up for your free trial a https://www.plumbr.io 5
6. Distributed tracing – why bother?
Plumbr - sign up for your free trial a https://www.plumbr.io 6
7. Distributed
tracing - why
bother?
Support
tickets like
this.
From: John
To: support@example.com
Subject: Cannot complete checkout
I just tried to complete the order
#32828, but was unable to finish the
checkout. Your app stalled for 20
seconds and then gave me an error.
7Plumbr - sign up for your free trial a https://www.plumbr.io
8. …. turning into
this in two
weeks
From: John
To: support@example.com
Subject: Re:Re:Re:Re:Re:Cannot
complete the checkout
Managed finally capture the HAR file
from my browser using the
instructions you altered. However it
is too big to be sent as email
attachment. Please advise
8Plumbr - sign up for your free trial a https://www.plumbr.io
10. What would
such a trace
look like?
10Plumbr - sign up for your free trial a https://www.plumbr.io
11. Cornerstone
of any
distributed
trace: UUID
Universally Unique
Identifier (UUID)
• 128-bit random number
• Requires no central
coordinator
• For practical
purposes, unique
• You are 460,000,000
times more likely to
die from meteorite
impact than to clash
on UUIDs
11
68a9ab9d-f457-4dc8-98b0-645ef476fda6
Plumbr - sign up for your free trial a https://www.plumbr.io
12. Plumbr - sign up for your free trial a https://www.plumbr.io 12
13. Plumbr - sign up for your free trial a https://www.plumbr.io 13
14. Plumbr - sign up for your free trial a https://www.plumbr.io 14
15. Passing the UUID: HTTP-headers
15
Plumbr - sign up for your free trial a https://www.plumbr.io
16. Plumbr - sign up for your free trial a https://www.plumbr.io 16
17. Plumbr - sign up for your free trial a https://www.plumbr.io 17
18. Plumbr - sign up for your free trial a https://www.plumbr.io 18
19. Plumbr - sign up for your free trial a https://www.plumbr.io 19
20. Plumbr - sign up for your free trial a https://www.plumbr.io 20
21. Outcome: distributed trace
• Consisting of spans
• Registering the duration and
outcome of the trace
• Enriched with additional metadata at
span/trace level:
• User ID
• Cluster the span belongs to
• Node ID of the span
• …
21Plumbr - sign up for your free trial a https://www.plumbr.io
22. Summary: three building blocks for distributed tracing
22Plumbr - sign up for your free trial a https://www.plumbr.io
23. Put the
distributed
traces into
good use
Removing the need to manually
reproduce and gather evidence when
responding to support tickets
Fully understanding the impact of user-
facing issues
Prioritizing the improvements based on
the impact to end user
Proactively responding to issues via
alerting based on the tracing information
23Plumbr - sign up for your free trial a https://www.plumbr.io
24. Hypothetical
support case
landing on
your desk
From: John
To: support@example.com
Subject: Cannot complete checkout
I just tried to complete the order
#32828, but was unable to finish the
checkout. Your app stalled for 20
seconds and then gave me an error.
24Plumbr - sign up for your free trial a https://www.plumbr.io
25. …. two weeks
later
From: John
To: support@example.com
Subject: Re:Re:Re:Re:Re:Cannot
complete the checkout
Managed finally capture the HAR file
from my browser using the
instructions you altered. However it
is too big to be sent as email
attachment. Please advise
25Plumbr - sign up for your free trial a https://www.plumbr.io
26. What happened during the two weeks?
26Plumbr - sign up for your free trial a https://www.plumbr.io
27. Could it have
been different?
Yes. Lets walk through examples
understanding how distributed
tracing helps you by:
• Verifying the claim
• Prioritizing the response
• Understanding the true impact
• Proactively handling such
problems
27Plumbr - sign up for your free trial a https://www.plumbr.io
28. Example #1: verifying the complaint
28Plumbr - sign up for your free trial a https://www.plumbr.io
29. Example #1: verifying the complaint
29Plumbr - sign up for your free trial a https://www.plumbr.io
30. Example #1:
complaint
verified
Metadata added to the
trace allowed us to search
for the evidence
Spans linked to the trace
allowed us to verify the
failure had indeed occurred
30Plumbr - sign up for your free trial a https://www.plumbr.io
31. Example #2: prioritizing the response
31Plumbr - sign up for your free trial a https://www.plumbr.io
32. Example #2: prioritizing the response
32Plumbr - sign up for your free trial a https://www.plumbr.io
33. Example #2: prioritizing the response
33Plumbr - sign up for your free trial a https://www.plumbr.io
34. Example #2:
priorities
assigned
based on the
impact
Unique identification of an error
coupled with distributed tracing
allows you to objectively quantify
the priority for a particular error.
In the specific situation, (a high
priority) response is likely not
justified.
34Plumbr - sign up for your free trial a https://www.plumbr.io
35. Example #3: zooming out to see what real users experience
35Plumbr - sign up for your free trial a https://www.plumbr.io
36. Example #3: zooming out to what real users experience
36Plumbr - sign up for your free trial a https://www.plumbr.io
37. Example #3:
true impact
only reveals
itself if traces
go all the way
to real user
Distributed tracing can and
should leave the server
rooms
End-to-end traces are the
way to expose both the
impact and root cause
correctly
37Plumbr - sign up for your free trial a https://www.plumbr.io
38. Example #4: becoming proactive
+
38Plumbr - sign up for your free trial a https://www.plumbr.io
39. Example #4: becoming proactive
39Plumbr - sign up for your free trial a https://www.plumbr.io
40. Example #4: do
not rely upon
end users.
Harness the
true power of
distributed
traces
Trigger alerts based on
the impact
Send the alerts to
channels in use
Respond to incidents
using the root causes
40Plumbr - sign up for your free trial a https://www.plumbr.io
43. Capturing a
trace with
Zipkin:
example
$tracing = create_tracing('php-frontend', '127.0.0.1');
$tracer = $tracing->getTracer();
$request = ComponentRequest::createFromGlobals();
/* Extract the context from HTTP headers */
$carrier = array_map(function ($header) {
return $header[0];
}, $request->headers->all());
$extractor = $tracing->getPropagation()-
>getExtractor(new Map());
$extractedContext = $extractor($carrier);
/* Create a span and set its attributes */
$span = $tracer->newChild($extractedContext);
$span->start(Timestampnow());
$span->setName('parse_request');
$span->setKind(ZipkinKindSERVER);
43Plumbr - sign up for your free trial a https://www.plumbr.io
44. Capturing a trace with Zipkin: example
44Plumbr - sign up for your free trial a https://www.plumbr.io
45. OS solutions:
flexible but
obtrusive
• You can tailor the metadata and model to match
your specific needs
• As a result, your application code is now
dependent on the framework
• In addition, there is the human factor – if you
forgot to add a particular endpoint, it will be
missing from traces
• Usability-wise, there are limited ways to query
and visualize the data.
45Plumbr - sign up for your free trial a https://www.plumbr.io
47. Capturing a trace with Plumbr: example
$ java -javaagent:/path/to/plumbr.jar com.example.YourExecutable
47Plumbr - sign up for your free trial a https://www.plumbr.io
48. Capturing a trace with Plumbr: example
48Plumbr - sign up for your free trial a https://www.plumbr.io
49. Commercial
solutions: cost
attached but
do the heavy
lifting for you
• Installation is easy
• No dependencies at source code level
• Less nuances to deal with
49Plumbr - sign up for your free trial a https://www.plumbr.io
50. Tying it
together
You now understand how distributed
tracing works
You got a sneak peek into how
different OS and commercial vendors
can help you to capture the
distributed traces
You are equipped with examples
how hard questions can be coupled
with simple answers thanks to the
distributed tracing helping you
50Plumbr - sign up for your free trial a https://www.plumbr.io
51. And of course, when you go to your journey with distributed tracing …
51
Plumbr - sign up for your free trial a https://www.plumbr.io
52. … Plumbr will be the solution to consider
52
Plumbr - sign up for your free trial a https://www.plumbr.io
53. We integrate with your existing ecosystem
53
Plumbr - sign up for your free trial a https://www.plumbr.io
54. And all the information exposed is based on the distributed traces
54
Plumbr - sign up for your free trial a https://www.plumbr.io
55. Thank you!
Ivo Mägi, CEO & product manager
@ Plumbr
55Plumbr - sign up for your free trial a https://www.plumbr.io
Editor's Notes
Highlight a downtime cost and maybe the more and more businesses relying on digital channels
Remember the old days? When the entire application under management consisted of one big box. Well, in reality you most likely had few of those running in load balanced cluster, but every node was identical.
Now, Instead of a few stable services under management. You now need to govern hundreds of fast-changing microservices. As a result, services break more frequently. Just to give you some idea – if every service you have is 99% available, then if you have 30 microservices under management, the end-to-end availability drops to 74%.
In order to help you fully comprehend and appreciate distributed tracing, let’s dive into a few details about what constitutes a trace.
A trace is the complete processing of a request. The trace represents the whole journey of a request as it moves through all of the services or components of a distributed system. All trace events generated by a request share a trace ID that tools use to organize, filter, and search for specific traces.
Distributed traces help IT and DevOps teams to monitor applications, especially those composed of microservices. Distributed tracing helps pinpoint where failures occur and what causes suboptimal performance.
In order to help you fully comprehend and appreciate distributed tracing, let’s dive into a few details about what constitutes a trace.
A trace is the complete processing of a request. The trace represents the whole journey of a request as it moves through all of the services or components of a distributed system. All trace events generated by a request share a trace ID that tools use to organize, filter, and search for specific traces.
Distributed traces help IT and DevOps teams to monitor applications, especially those composed of microservices. Distributed tracing helps pinpoint where failures occur and what causes suboptimal performance.
Kas suudame seda protsessi kuidagi lihtsalt animeerida? Mikroteenuste pildi peal?
Ülevalt tuleb päring sisse
Esimese node juures luuakse ID (midagi automaatset, a la sdv0894vöeb8sv) ja registreeritakse alguse aeg
Päring liigub teise node juurde
Teise node juures on sama ID ja rügatakse alguse aeg
Päring liigub kolmanda node juurde
Kolmanda juures sama ID ja alguse ning lõpu aeg
Päring liigub tagasi teise node juurde, teine node saab lõpuaja
Liigub tagasi esimese node juurde, esimene saab lõpuaja
Liigub ülevalt välja
Kõikidest nodedest liigub info monitooringu keskserverisse
The last piece of all tracing infrastructure is the monitoring agents themselves.
Monitoring agents are a work of software craft by themselves.
The common denominator among all web applications is http or https traffic.
Therefore it is a common practice to have agents that can operate at the lowest levels so that they can capture the complete details of traffic between all the nodes in an application.
Agents must be able to capture and analyze traffic in a manner that is agnostic to languages, frameworks, and other infrastructure.
Such agents are built either using the language-specific APIs at bytecode level (such as Java or .NET agents) or dig deeper and hook into system library calls via LD_PRELOAD at native code level.
So we covered the concept. Distributed tracing builds up a data model, consisting of traces and spans which are uniquely identified and contain valuable metadata. This data is captured by agents, deployed per microservice under monitoring. The data is sent to the central server where it is processed and made available for querying and visualization.
I bet your mind is already racing a million miles a minute, thinking about all the cool things that can be done given such information, right? Let me show you three examples, going from a simple and straightforward use case to something I bet you never even thought about:
Just as easily, you are now able to confirm that the customer complaint is real - evidence is right in front of you. No more trying to reproduce or gathering additional evidence. You see the failure right in front of you. So you can confirm the presence of the issue and proceed with fixing the bug right away. Right?
Just as easily, you are now able to confirm that the customer complaint is real - evidence is right in front of you. No more trying to reproduce or gathering additional evidence. You see the failure right in front of you. So you can confirm the presence of the issue and proceed with fixing the bug right away. Right?
Hm. Now you still have the evidence right in front of you. John Smith indeed was unable to complete the checkout, but apparently he was the only one experiencing this error. Should you really spend your time on this issue, considering all the other bugs and features waiting for your attention in the backlog?
Whoa. So it was not the checkout all along. It was the subscription details that was the culprit. Apparently the failure to fetch subscription details was not handled properly and thus a non-existing subscription ID was passed to the checkout. But the subscription detail API has been failing for hours now, for hundreds of users. Houston, P1!
Whoa. So it was not the checkout all along. It was the subscription details that was the culprit. Apparently the failure to fetch subscription details was not handled properly and thus a non-existing subscription ID was passed to the checkout. But the subscription detail API has been failing for hours now, for hundreds of users. Houston, P1!
As every chef I also happen to be proud of my own menu. Plumbr APM and RUM solutions are especially good in doing all and more that we described today. If you were inspired, go and grab your free trial and check out how we can change the way you handle availability and performance issues in production.
As every chef I also happen to be proud of my own menu. Plumbr APM and RUM solutions are especially good in doing all and more that we described today. If you were inspired, go and grab your free trial and check out how we can change the way you handle availability and performance issues in production.
As every chef I also happen to be proud of my own menu. Plumbr APM and RUM solutions are especially good in doing all and more that we described today. If you were inspired, go and grab your free trial and check out how we can change the way you handle availability and performance issues in production.
As every chef I also happen to be proud of my own menu. Plumbr APM and RUM solutions are especially good in doing all and more that we described today. If you were inspired, go and grab your free trial and check out how we can change the way you handle availability and performance issues in production.