Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

QCon London - How to build observable distributed systems

405 views

Published on

Being able to observe the state of a running application is key to understanding a system's behaviour and essential if you want to fix production problems quickly and efficiently. Like a lot of other things, this is harder to do in distributed systems than it is with a monolith. At Poppulo we've been running a distributed system of hundreds of microservices in production for more than 4 years and we got to understand how critical this visibility is. If you want to succeed with operating a distributed system, observability should be an integral part of system design. I'll cover key techniques to build a clearer picture of distributed applications in production, including details on useful health checks, best practices for instrumentation with metrics, logging and tracing.

Published in: Software
  • Be the first to comment

QCon London - How to build observable distributed systems

  1. 1. @PierreVincent How to build observable distributed systems March 8th, 2018 – QCon London @PierreVincent pvincent.io
  2. 2. @PierreVincent
  3. 3. @PierreVincent Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io
  4. 4. @PierreVincent Reaching production is only the beginning
  5. 5. @PierreVincent No system is immune to failure Be ready to recover
  6. 6. @PierreVincent When distributing a system, we’re also distributing the places where things might go wrong
  7. 7. @PierreVincent Monitoring only applies to known failure modes. What about everything else?
  8. 8. @PierreVincent Monitoring tells you whether the system works. Observability lets you ask why it's not working. “ ”– Baron Schwarz
  9. 9. @PierreVincent Healthchecks Metrics Logging Tracing
  10. 10. @PierreVincent Healthchecks
  11. 11. @PierreVincent Is it running? Can it perform its task? Can it accept more work? Healthchecks
  12. 12. @PierreVincent Broadcast Register Expose Healthchecks
  13. 13. @PierreVincent Healthchecks /health { "service": "registration-service", "healthy": true, "workload": { "healthy": true }, "dependencies": [ { "name": "cassandra", "healthy": true }, { "name": "billing-svc", "healthy": true }, ] } GET http://1.2.3.4:8080/health 200 OK
  14. 14. @PierreVincent Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform Overzealous Healthchecks can be counter-productive
  15. 15. @PierreVincent Metrics
  16. 16. @PierreVincent System metrics Application metrics Business metrics CPU usage Error rates Customer conversions Metrics
  17. 17. @PierreVincent Servers / VMs Appliances/Infra Services Metrics collector Metrics query engine Dashboards Alerts Metrics
  18. 18. @PierreVincent Servers / VMs Appliances/Infra Services /metrics /metrics /metrics Prometheus Metrics
  19. 19. @PierreVincent Watch out for over-reliance on metrics Limitations at high-cardinality Not every metric deserves an alert Real-time querying means some trade-offs on retention Limit alerting to user-impacting symptoms Poor fine-grained debugging e.g. CustomerId Not suitable for long-term trend analysis
  20. 20. @PierreVincent Logging
  21. 21. @PierreVincent Searchable Correlated Logging Making sense of (a lot of) logs Centralised
  22. 22. @PierreVincent A F H D J B E C G a1b2c3 a1b2c3 a1b2c3 ERROR [svc=H][trace=a1b2c3] Failed to save order Cause: Cassandra timeout exception ERROR [svc=F][trace=a1b2c3] Failed to complete order Cause: Shipping service responded with 500 ERROR [svc=A][trace=a1b2c3] Failed to process order Cause: Order process manager responded with 500 a1b2c3 INFO [svc=G][trace=a1b2c3] Items verified in stock Log Correlation
  23. 23. @PierreVincent JSON 2018-02-20T16:38:23+00:00 ERROR Read timed out timestamp 2018-02-20T16:38:23+00:00 requestID ec667cb45 level ERROR team eventsservice registration-service commit 542a8b8e build 542a8b8e.7 node node_e79f3e52 log Read timed out region europe-west2 runtime java-1.8.0_161 When did it happen? Can I trace it? What is it? What is it running? Where is it? What is the message? customerID 55123Who caused it? userID 458 ... ...Any other info? Hmmm thanks… ?
  24. 24. @PierreVincent Structured logs unleash high-cardinality exploration Error rate spike isolated by build version Activity spike isolated for single customer
  25. 25. @PierreVincent Tracing
  26. 26. @PierreVincent Tracing get_confirmed_attendees get_attendees get_confirm_status cassandra/select mysql/select event-mgt-api attendees-service registration-service Trace Spans 0ms 50ms 100ms 150ms 172ms 172ms 73ms 54ms 78ms 41ms
  27. 27. @PierreVincent
  28. 28. @PierreVincent Unknown Unknowns Known Unknowns Healthchecks Metrics (Alerting) Logs Metrics (Queries) Tracing Events Monitoring & Resiliency Debugging & Exploration
  29. 29. @PierreVincent Cheap to instrument Easy to explore Reliable & Trustworthy Usability of tooling is key to adoption
  30. 30. @PierreVincent Visibility builds trust but requires safety
  31. 31. @PierreVincent Visibility helps justify decisions
  32. 32. @PierreVincent Visibility enables operability
  33. 33. @PierreVincent Test (a little bit*) lessTest (a little bit*) lessDon’t spend all your time testing ...keep some for instrumenting Thank you! @PierreVincent pvincent.io
  34. 34. @PierreVincent Thank you! Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io

×