Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring microservices

54 views

Published on

Microservices are a great way to design your system so that it can scale. But once those pieces are in production, how do you know if all the different pieces are working properly? Are some metrics more important than others, and what story can each of the metrics tell you? This talk shows you some tools and techniques to monitor distributed systems

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Monitoring microservices

  1. 1. Techniques for monitoring Microservices William Brander @williambza Particular Software
  2. 2. An average production system Database • Is the web server up? • Is the database up? • Can the webserver talk to the db?
  3. 3. What are you actually monitoring? Business Capability Application Infrastructure Are my servers running?Is my application process running?Can users place an order? Monitoring Area
  4. 4. Monitoring Concerns Capacity Performance Health Is the server up?Is there high CPU?Do I have enough disk space? Is my application generating exceptions? How quickly is my system processing messages? Can I handle month end batch jobs? Is the server up? Is there high CPU? Do I have enough disk space? Application Infrastructure Can users access the checkout cart? Are we meeting our SLAs? What is the impact of adding another customer? Business Capability
  5. 5. Interaction Type Proactive Reactive Passive The monitoring system can display metrics The monitoring system alerts me when something happens The monitoring system automatically takes actions to repair the system
  6. 6. A Monitoring Philosophy Business Capability Application Infrastructure Capacity Performance Health Monitoring Area Monitoring Concern Proactive Reactive Passive Interaction Type
  7. 7. Recap: What are we monitoring? Database • Is the web server up? • Is the database up? • Can the webserver talk to the db? Infrastructure PassiveHealth
  8. 8. 27 28 29 30 31 32 33 34 35 37 40 41 42 43 45
  9. 9. Recap: What are we monitoring? • Warn me with the queue length exceeds 50 Infrastructure ReactivePerformance
  10. 10. A Monitoring Philosophy Business Capability Application Infrastructure Capacity Performance Health Monitoring Area Monitoring Concern Proactive Reactive Passive Interaction Type
  11. 11. What happens when we distribute the systems?
  12. 12. Going Distributed EmailPDF CRM
  13. 13. Let’s look at queue length
  14. 14. Queue Length • Queue length is an indicator of work still outstanding • High queue length doesn’t necessarily indicate a problem though Stable or decreasing is good Increasing is bad
  15. 15. Infrastructure Performance
  16. 16. Processing Time ⏱️ ⌛✔
  17. 17. Processing Time • Processing Time is the time taken to successfully process a message • Processing Time does not include error handling time • It is independent of queue wait time Stable or decreasing could be good Increasing is bad
  18. 18. PerformanceApplication
  19. 19. ✔⌛ ⏱️ Critical time ⏱️ Critical time = The entire time taken to process a message successfully
  20. 20. • Critical Time is the total duration between when a message is created to when it is processed Critical Time = Time in Queue + Processing Time + Retry Time + Network Latency Time Critical Time Stable or decreasing could be good Increasing is bad
  21. 21. Putting these together • Each of these metrics presents a piece of the puzzle • Look at them from an endpoint’s perspective, not per message • Looking at them together gives great insight into your system Critical Time Processing Time Queue LengthCritical Time Processing Time Queue LengthCritical Time Processing Time Queue Length
  22. 22. Detecting Connectivity • Distributed systems typically work when other parts aren’t available • How do you know the endpoint you’re sending messages to is actually processing messages?
  23. 23. Detecting Connectivity Peer-to-peer connectivity tells us if an endpoint is actually processing messages from another
  24. 24. How do we collect all this info? ⏱️ • Processing Time • Critical Time • Queue Length • Connectivity • Reporting Metric • Message Type • Timestamp • Value • Reporting Metric (N bytes) • Message Type (N bytes) • Timestamp (8 bytes) • Value (8 bytes)
  25. 25. How do we collect all this info? • Epoch time (8 bytes) • Dictionary of Metric Types (n* (N + 4) bytes) • Dictionary of Message Types (n * (N + 4) bytes) • An array of: • Reporting Metric index (4 bytes) • Message Type index (4 bytes) • Epoch offset (4 bytes) • Value (8 bytes)
  26. 26. Getting all the data
  27. 27. Getting all the data
  28. 28. Getting all the data
  29. 29. Techniques for monitoring Microservices William Brander @williambza Particular Software

×