Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano


Published on

Modern software development is increasingly taking a “microservice” approach that has resulted in an explosion of complexity at the network level. We have more applications running distributed across different datacenters. Distributed tracing, events, and metrics are essential for observing and understanding modern microservice architectures.
This talk is a deep dive on how to monitor your distributed system. You will get tools, methodologies, and experiences that will help you to realize what your applications expose and how to get value out from all these information.
Gianluca Arbezzano, SRE at InfluxData will share how to monitor a distributed system, how to switch from a more traditional monitoring approach to observability. Stay focused on the server’s role and not on the hostname because it’s not really important anymore, our servers or containers are fast moving part and it’s easy to detach it from the right in case of trouble than call the server by name as a cute puppet. How to design a SLO for your core services and now to iterate on them. Instrument your services with tracing using tools like Zipkin or Jaeger to measure latency between in your network.

Published in: Software
  • Be the first to comment

  • Be the first to like this

OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano

  1. 1. @gianarb - Distributed monitoring How to understand the chaos
  2. 2. @gianarb - Who Am I? ● Software Engineer passionate about almost everything atm I work with Golang ● Open Source developer, Docker Captain and CNCF Ambassador ● Site Reliability Engineer at InfluxData ● Speaker, blogger ( and so on... ● I love to travel, I grow my vegetables and I cook time to time
  3. 3. @gianarb -
  4. 4. @gianarb -
  5. 5. @gianarb -
  6. 6. @gianarb -
  7. 7. @gianarb -
  8. 8. @gianarb - What are you trying to say? Microservices is not “The Distributed System”
  9. 9. @gianarb - What are you trying to say? A queue system can be distributed too...
  10. 10. @gianarb - What are you trying to say? A multi threads application is a distributed system
  11. 11. @gianarb - What are you trying to say? More “it is distributed” across servers, worlds, cloud providers and more complex it is...
  12. 12. @gianarb - What are you trying to say? Containers, Docker, Kubernetes, Cloud Computing accelerated the application distribution...
  13. 13. @gianarb - What are you trying to say? Did you migrated to a distributed mess without a real needs? Blame yourself...
  14. 14. @gianarb - What are you trying to say? A request rises and falls across multiple applications before back to the user. This is complexity.
  15. 15. @gianarb - Consequences The logs are not easy to follow when they comes from distributed applications, it is not a single stream.
  16. 16. @gianarb - Consequences Events and metrics need to be correlated.
  17. 17. @gianarb - Distributed Tracing Tracing is a way to correlate logs using a set of IDs
  18. 18. @gianarb -
  19. 19. @gianarb - Criticalities We write applications in many different languages
  20. 20. @gianarb - Criticalities Across different teams
  21. 21. @gianarb - Criticalities At the end, to build a trace we need to agree on the same protocol no matters the language or the team.
  22. 22. @gianarb - Distributed Tracing Opentracing is a standard sponsored by the Cloud Native Computing Foundation (CNCF) developed to agree on common rules. It provides libraries across languages and you can use many tracers open source and as a service.
  23. 23. © 2017 InfluxData. All rights reserved.24 OpenTracing API application logic µ-service frameworks Lambda functions RPC & control-flow frameworks existing instrumentation tracing infrastructure main() I N S T A N A J a e g e r microservice process
  24. 24. © 2017 InfluxData. All rights reserved.25 >2 year old! Tracer implementations: Zipkin, Jaeger, LightStep, SkyWalking, AWS X-Ray.... All sorts of companies use OpenTracing:
  25. 25. © 2017 InfluxData. All rights reserved.26 Rapidly growing OSS and vendor adoption JDBIJava Webservlet
  26. 26. @gianarb - High Cardinality A trace contains a lot of information and they are indexed via request id (called trace_id). They are expensive to store.
  27. 27. @gianarb - Distributed Tracing Luckily traces doesn’t have a long lifecycle. Usually, you use them to debug a problem happened almost in real time or in short time window.
  28. 28. @gianarb - Distributed Tracing We set a week as retention policy, after 7 days we downsample and remove the original trace. We also downsample them based on how many requests we are receiving for a specific API call.
  29. 29. @gianarb - import opentracing "" import zipkin "" collector, err := zipkin.NewHTTPCollector(tracingConf.ZipkinEndpoint) recorder := zipkin.NewRecorder(collector, false, fmt.Sprintf("", tracingConf.Port), "restapi") tracer, err = zipkin.NewTracer( recorder, zipkin.ClientServerSameSpan(false), zipkin.TraceID128Bit(false), ) opentracing.SetGlobalTracer(tracer)
  30. 30. @gianarb - import opentracing "" tracer := opentracing.GlobalTracer() sp := tr.StartSpan(“api.create_user”) defer sp.Finish()
  31. 31. @gianarb -
  32. 32. @gianarb - Distributed Tracing
  33. 33. @gianarb -
  34. 34. @gianarb - Distributed Tracing - Collect traces via Telegraf
  35. 35. @gianarb -
  36. 36. @gianarb -
  37. 37. @gianarb - No UI at the moment. :( SELECT * FROM zipkin WHERE time < now() - 1h AND trace_id = ‘a4hs45hs46jd56j4s’
  38. 38. @gianarb - The process of understanding 1. Instrument 2. Observe 3. Aggregate and sample 4. Take action (via alerts or whatever)
  39. 39. @gianarb - How people and teams play this game? They should deploy their application.
  40. 40. @gianarb - How people and teams play this game? Be on-call and they should take care about production behavior for their applications
  41. 41. @gianarb - How people and teams play this game? They can write a “presentation” of their service (a doc): critical metrics, capacity planning (cpu, ram, disk intensive app), service location in the system (close to other apps, ssd)
  42. 42. @gianarb - How people and teams play this game? Keep everyone in the loop and responsible for the real traffic. There is not fun writing code without running it in production!
  43. 43. @gianarb - How people and teams play this game? Every tools we develop exposes APIs, developers can use them. Eg. Create runtime alerts with Kapacitor.
  44. 44. @gianarb - Servers/Containers/VMs are not pets 1. They don’t have name because they come and go based on loads and needs. 2. You can’t watch cute servers’s picture on Instagram. Yet.. 3. A server has labels.
  45. 45. @gianarb - Servers/Containers/VMs are not pets Write tools that helps you to replace servers and processes or use available projects like AWS Autoscaling group, Kubernetes and so on. DevOps is an attitude is not a role that you hire. They are developer passionate about server automation and related stuff.
  46. 46. @gianarb -
  47. 47. @gianarb -
  48. 48. @gianarb - We care about state and event 1. Use created 2. Invoice generation 3. Email sent 4. Purchase 5. Whatever...
  49. 49. @gianarb - We care about data! But data is all another topic!
  50. 50. @gianarb - Back to tracing - the cost of a retry
  51. 51. @gianarb - Distributed Tracing - the cost of a retry
  52. 52. @gianarb - Wrap up! ● Monitor distributed system is hard and you need to correlate all the things ● Opentracing and distributed tracing ● Keep people in the loop and give ownership of production ● DevOps is an attitude ● Servers/Containers/Processes are not pet ● Application state and events ● Listen to your system and have fun
  53. 53. @gianarb - Collect data is just the beginning Aggregation, alerting, downsampling are other important steps to answer a question