Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring Is Never Done


Published on

Our monitoring team works in a cycle of 4 phases: Definition, Collection, Visualization and Action. We've found it effective to be clear about what phase we are in to help communicate our needs as well as our progress. This talk was presented as a lightning talk at Monitorama 2015 by Melanie Cey

Published in: Technology
  • Be the first to comment

Monitoring Is Never Done

  1. 1. Monitoring is Never “Done” @melaniemj
  2. 2. Responsibilities @ Yardi Implementation and administration of monitoring, alerting, and log aggregation/analysis tools. o 15,000+ Devices o 9 Datacenters o 5000+ Customer Installations o We monitor windows envs with linux envs
  3. 3. This was me in 2008 @ Point2
  4. 4. How code is delivered
  5. 5. How code operates in production
  6. 6. A good problem to have Everyone wants “the monitoring” so they can say “it’s monitored”
  7. 7. Communicating Work o Classify o Quantify o Qualify
  8. 8. Words.... o Logging o Alerting o Dashboards o Reports o 4-9s o 24x7x365 this shit can’t go down
  9. 9. Can it be this simple? Let’s talk about “the monitoring” for X Be awesome X is monitored
  10. 10. DCVA (OODA)
  11. 11. 1. Definition I can hit this one page so it’s up right? No thanks, let’s redefine status
  12. 12. 1. Definition o What questions are you trying to answer? o What information do you need when a failure occurs? o What are the most common failures? o Who is the audience for the information?
  13. 13. 2. Checks & Collections o Environment & Code o Data points o Detailed logs o Current state
  14. 14. 3. Visualization o Analysis o Dashboards o Correlations
  15. 15. 4. Action o Fault detection o Alerting o RCA
  16. 16. Cycle (What to collect) (Inform on failure) (How to collect) (Make collections pretty)
  17. 17. Team Time Distribution
  18. 18. Time Distribution (Desired)
  19. 19. Is “X” monitored? When “X” goes into some degraded state o The right people know. o They have enough information to find the problem, recover, and later to do RCA. o If they don’t they will revisit definition.
  20. 20. How does your team o Classify o Quantify o Qualify
  21. 21. Monitoring is Never “Done” Melanie Cey @melaniemj Senior Systems Analyst Systems Reliability Engineering @ Yardi