Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Increasing visibility of distributed systems in production


Published on

Talk from October 20th 2017 at Velocity London (

Understanding the running state of an application is the key to efficiently troubleshoot production issues and ultimately anticipate outages.

When systems grow larger and become distributed, the visibility of application health needs to become a first class concern; as the likelihood of something going wrong increases, the focus is shifting from increasing Mean-Time Between Failures to reducing Mean-Time To Recovery. The best way to achieve this consistently is to build in monitoring as an integral part of Product development, instead of it being an after thought.

Monitoring can start simple, with basic telemetry such as Healthchecks which will increase visibility in the system's status. Exposing more advanced Metrics can then give more details on how the system is working, on a system level (e.g. resource usage), application level (e.g. response times) and business level (e.g. completed sales). Later on, these Healthchecks & Metrics can be used to trigger alerts when observed values are outside of expected thresholds.

This talk aims at providing an overview of different monitoring patterns and how different tools can be used to help build a fuller picture of a running application.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Increasing visibility of distributed systems in production

  1. 1. Increasing visibility of distributed systems in production Oct 20th, 2017 – Velocity, London @PierreVincent
  2. 2. Pierre Vincent SRE Manager at Poppulo @PierreVincent
  3. 3. Visibility Relative ability of being perceptible to the eye
  4. 4. @PierreVincent Reaching production is only the beginning
  5. 5. @PierreVincent No system is immune to failure Be ready to recover
  6. 6. @PierreVincent When distributing a system, we’re also distributing the places where things might go wrong
  7. 7. @PierreVincent Healthchecks
  8. 8. @PierreVincent Is it running ? Can it perform its task ? Can it accept more work ? Healthchecks
  9. 9. @PierreVincent Broadcast Register Expose Healthchecks
  10. 10. @PierreVincent Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell healthchecks-for-a-resilient-platform Read more about it! Overzealous Healthchecks can be counter-productive
  11. 11. @PierreVincent Metrics
  12. 12. @PierreVincent System metrics Application metrics Business metrics CPU usage Error rates Customer conversions Metrics
  13. 13. @PierreVincent Servers / VMs Appliances/Infra Services Metrics collector Metrics query engine Dashboards Alerts Metrics
  14. 14. @PierreVincent Servers / VMs Appliances/Infra Services /metrics /metrics /metrics Prometheus Metrics
  15. 15. @PierreVincent Usability of metrics tooling is key to adoption Instrument code Query metrics Create dashboards Alert on thresholds Metrics
  16. 16. @PierreVincent Limit alerting to user- impacting symptoms Expose dashboards to diagnose causes
  17. 17. @PierreVincent Logging & Tracing
  18. 18. @PierreVincent Common searchable format Correlation IDs Logging Making sense of (a lot of) logs Centralise logs
  19. 19. @PierreVincent A F H D J B E C G a1b2c3 a1b2c3 a1b2c3ERROR [svc=H][trace=a1b2c3] Failed to save order Cause: Cassandra timeout exception ERROR [svc=F][trace=a1b2c3] Failed to complete order Cause: Shipping service responded with 500 ERROR [svc=A][trace=a1b2c3] Failed to process order Cause: Order process manager responded with 500 a1b2c3 INFO [svc=G][trace=a1b2c3] Items verified in stock Tracing
  20. 20. @PierreVincent
  21. 21. @PierreVincent Visibility enables operability
  22. 22. @PierreVincent Visibility helps justify decisions
  23. 23. @PierreVincent Visibility builds trust but requires safety
  24. 24. @PierreVincent If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable. “ ”B. Beyer, C. Jones, N. Murphy, J. Petoff Site Reliability Engineering
  25. 25. @PierreVincent Thank you! Pierre Vincent SRE Manager at Poppulo