Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

StatsCraft 2015: Top down approach to monitoring - Shahar Kedar

300 views

Published on

Slides of Shahar Kedar's talk at StatsCraft 2015 event

Published in: Technology
  • Be the first to comment

  • Be the first to like this

StatsCraft 2015: Top down approach to monitoring - Shahar Kedar

  1. 1. Top-Down Approach to Monitoring July 30, 2015
  2. 2. 1996 2 Tivoli Software acquired by IBM Patrol Software acquired by BMC Ethan Galstad creates a simple
 MS-DOS application designed to 
 "ping" Novell Netware servers “HOW to monitor?” is the primary question
  3. 3. 2015 3 https://www.bigpanda.io/monitoringscape/
  4. 4. Shifting from “How?” to “What?” 4
  5. 5. 5
  6. 6. Bottom-Up Approach 6 Network Servers Apps Overall System Health
  7. 7. Problem #1: Inflation of Tools 7
  8. 8. Problem #2: Inflation of “Whats” 8
  9. 9. Problem #3: Inflation of Alerts 9
  10. 10. 10
  11. 11. 11 We’re trying to answer a simple question: Is our system in a healthy state?
  12. 12. 12 No Alerts Many Alerts Unhealthy System≠ ≠ Healthy System
  13. 13. 13 Healthy System = A system that continuously 
 generates value for its users
 under a well known set of KPIs
  14. 14. Top-Down Approach 14 KPIs UX Overall System Health
  15. 15. 15 KPIs UX Overall System Health Network Servers Apps Overall System Health • Selective • Proactive • Exhaustive • Reactive vs Bottom-UpTop-Down
  16. 16. A key performance indicator (KPI) is a business metric used to evaluate factors that are crucial to the success of an organization. KPIs differ per organization; Definition of KPI 16
  17. 17. Let’s play a game! 17 CPU Utilization # Clicks on 
 a button TemperatureThis is Sam What does Sam’s company do?
  18. 18. We sought out a single indicator that closely approximated our most important activity: viewing. We discovered that a server-side metric related to playback starts (the act of “clicking play”) had both a predictable pattern and fluctuated significantly when UI/device/server problems were happening. The Netflix streaming pulse was created. 
 
 The Pulse of Netflix 18 http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html We named it “SPS” for “starts per second”.
  19. 19. Healthy SPS Pattern 19 http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
  20. 20. Unhealthy SPS Pattern 20 http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
  21. 21. What’s so special about SPS? 21 • SPS is easy to understand by all stakeholders • One metric that covers different point of failure: server problems, device problems, etc. • Most important: it’s a clear KPI that indicates when user experience is compromised
  22. 22. But what about root cause analysis? 22 KPIs UX Overall System Health Network Servers Apps
  23. 23. Github: need for speed 23 https://github.com/blog/1252-how-we-keep-github-fast The most important factor in web application design is responsiveness. And the first step toward responsiveness is speed. But speed within a web application is complicated.
  24. 24. Start from the Top:
 Response Times Dashboard 24 https://github.com/blog/1252-how-we-keep-github-fast • Each row represented a different major
 component • Clicking one of the rows allows you to dive in 
 and see the mean, 98th percentile, and 99.9th 
 percentile response times
  25. 25. Digging Deeper:
 Mission Control Bar 25 https://github.com/blog/1252-how-we-keep-github-fast Total Time Render Time Cache & Database JS & CSS Size
  26. 26. And Deeper 26 https://github.com/blog/1252-how-we-keep-github-fast Render Breakdown SQL Query Viewer
  27. 27. 27 Why talk about BigPanda? Because Pandas 
 are awesome!
  28. 28. BigPanda 28 Because.. • We’re not Netflix or Github: growing startup (7 devs, 1 full-time Ops) • We feel the pain! • Our KPIs are easy to describe and understand (especially if you’re an Ops person)
  29. 29. BigPanda 29 As a unified dashboard on top of all your monitoring systems, and eventually a single point of truth for production incidents, our data pipeline has to be reliable and fast. KPI: Low data pipeline latency
  30. 30. Pipeline Latency Metric 30 • Metric are sent from within the apps • Stored in Graphite • Sum of all the average latencies of all alerts that went through the pipeline • Monitored by Nagios
  31. 31. • Very good indicator of possible service outage • Must have for detection of SLA violation • Very good indicator of performance bottlenecks (can be broken down to sub- pipelines / specific organizations etc) • Simple and high-level: easy to explain to non- technical stakeholders (e.g. sales) Pipeline Latency Metric 31
  32. 32. • Bottom-up approach (“monitor all the things”) is easier to start with, but soon enough leads to alert fatigue and disorientation. • Top-down approach requires thought and custom instrumentation, but keeps you focused on what’s important. • High level metrics can be complemented by low level metrics. Trying to deduce the former from the latter is futile. • Take advantage of the rich monitoring landscape, but as means to an end. Don’t let the tools dictate to you what you need to measure. • Monitoring is - first of all - about your business. TL;DR 32
  33. 33. 33 Questions?
  34. 34. 34 Thanks!

×