Atlassian HostedOps Ondemand Monitoring

4,098
-1

Published on

This presentations contains partly previous, current and future state of our monitoring strategy. So don't implement it all :D

Published in: Technology, Design
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,098
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
52
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Atlassian HostedOps Ondemand Monitoring

    1. 1. Monitoring & Metrics HOSTEDOPSPatrick Debois - http://jedi.be/blog
    2. 2. Recently started atProduct Service
    3. 3. Availability Uptime(Uptime + Downtime)
    4. 4. Availability vs Downtime (unplanned) Availability Downtime 90% (one 9) 36.5 days/year 99% (two 9s) 3.65 days/year 99.9% (three 9s) 8.76 hours/year 99.99% (four 9s) 52 minutes/year 99.999% (five 9s) 5 minutes/year 99.9999% (six 9s) 31 seconds/year
    5. 5. Decomposing Availability MTBF MTTF MTTD MTTR MTTFCorrect Behavior Diagnose Repair Correct Behavior Time First Begin End Second Failure Repair Repair Failure MTTF : Mean Time To Failure MTTD : Mean Time To Diagnose MTTR : Mean Time To Repair MTBF : Mean Time Between Failure
    6. 6. Availability MTTF MTBF
    7. 7. Availability MTTFMTTF+MTTD+MTTRMTTF : Mean Time To FailureMTTD : Mean Time To DiagnoseMTTR : Mean Time To RepairMTBF : Mean Time Between Failure
    8. 8. Mean Time To Diagnose ~ Mean Time to Detect ~ Mean Time to Notification ~ Mean Time to Respond ~ Time to understand problem
    9. 9. Mean Time To Repair ~ Time to restart/reset a service ~ Time to diagnose problem ~ Time to find information (~ Mean Time between Blame)
    10. 10. Mean Time To Failure ~ Detect component fatigue ~ Check trends in capacity & usage ~ Design and component selection ~ Monitoring service abuses ~ Security Checks ~ Auditing/Verification Checks ~ Testing changes
    11. 11. ChallengeSCALE 20.000 JVM in 60 Days
    12. 12. Sharing/Self Service• Radiate information• Think ‘opendata’/‘openapi’• Reduce middleman effort
    13. 13. Gmond Metrics Collection Collectd Nagios Analytics “measurement of a particular characteristic” MetaOps (deploys,syncs, etcc) Logstash MTTF-- (prevent) Statsd MTTD-- (detect) Google MTTD-- (diagnose) Analytics MTBF++(science to drive changes) HIT Metrics
    14. 14. Events Collection Nagios Analytics “A thing that happens in time” MetaOps (deploys,syncs, etcc) Logstash MTTD-- (detect) MTTD-- (diagnose) Google MTBF++(science to drive changes) Analytics HIT Metrics
    15. 15. Log collectionLogstash “A specialized form of event/metrics” MTTD-- (detect) MTTD-- (diagnose) MTBF++(science to drive changes)
    16. 16. Fact collectionPuppet Facts “think CMDB” Product versions MTTD-- (diagnose - with context) MTBF++(science to drive changes)
    17. 17. Transport“Collection1” “Opentsdb”“Collection..” “Graphite ..”“CollectionM” “Storage N” Connection/Conversion explosion“Collection1” RabbitMQ “Opentsdb”“Collection..” 0MQ “Graphite”“CollectionM” P2P System “Storage N” Flexibility/Scaling/Opens up Information
    18. 18. Storage Graphite Opentsdb Logstash Event Elastic Search Meta Facts Store Nagios (~cmdb++)MTTD-- (diagnose - understand context)MTBF++(science to drive changes)
    19. 19. Visualisation Deploy toolGraph Dashboard ~Etsy Triage Tool~Etsy dashboard deployinator “Storage” MTTD-- (detect) MTTD-- (diagnose)
    20. 20. (service) APIs Check/AlertGraph API Ingest Data Status API API Extract NotificationSearch Data Data API “Polling vs Evented/Streamed API” MTTD-- (diagnose - understand context) MTBF++(science to drive changes) “Storage”
    21. 21. Checks & Alerts Nagios CEP Check (Esper) Opentsdb Http check check on on steroids steroids “Polling vs Evented/Streamed API”MTTD-- (detect)
    22. 22. Check - Coverage Business Perspective (~ behavioral tests) Technical Perspective (diagnostics ~ unit tests)MTTD-- (detect)
    23. 23. Check - Scaling ControllerNagios 1Nagios 2 QueueNagios N Workers 1 Workers 2 Workers 3 Evented ~ Eventmachine Worker Model ~ Sensu/Flapjack
    24. 24. Notification Logic • Collapse events • When to alert or not • Who to alert • Dependencies • Minor/Major/CriticalPrevent False Notification/Alert fatigue/Alert overflow~Failing CI cycle should be fixed ASAP - Hygiene
    25. 25. Notification Zendesk Hopsbot/Irc Nagios Alert Email MetaOps Pager Duty Notification APIMTTD-- (diagnose -> better notification)
    26. 26. COLLECTION(metrics/events) MONITORING OVERVIEW 19/01/2012 VISUALIZATION Gmond Deploy Graph Triage Tool Dashboard Dashboard Collectd Nagios STORAGE Event Opentsdb Graphite Storage Analytics Logstash (Externalized) API MetaOps Elastic Meta Facts (deploys, TRANSPORT search syncs, etc..) Graph API Status API Extract Data RabbitMq New Relic MSB Notification Ingest Data API Logsstash 0mq Search Data Event API Statsd P2P Tracker CHECKS/ALERT (is memory < 90%) NOTIFICATION LOGIC NOTIFICATION Google Analytics Nagios CEP some logic Zendesk Hopsbot Check (Esper) HIT Metrics Opentsdb Http check check on Nagios Alert Email on steroids steroids Prioritizing rules: Increase Uptime: MetaOps MSB (detect better) MTTD -> enhances SLA(resolve faster) MTTR -> enhances SLA Prevent/Predict Failure MTBF
    27. 27. Automation• Drives faster changes• Frees up time for more useful things• Repeatability• Config Mgmt / Puppet
    28. 28. Orchestration• Workflow of automation• ~rundeck, mcollective, salt
    29. 29. Continuous Deployment • Same workflow for app & infra changes • Architecture should allow one button deploy • Independent of approval cycle • Better Testing/Confidence MTTF-- : changes without downtime
    30. 30. Definition of Done• Developed• Tested• Deployed Even for our own infrastructure stuff• Monitored• Backup• Performance/Scales• Failover• Secure
    31. 31. Continuous Delivery Automation DEV TEST PROD Monitoring & Metrics Feedback/LearningFeedback can lead to automated fixing known states
    32. 32. Beyond technical• Retrospectives / Post Mortems• Knowledge base• Pairing• Cross-Skilling MTTD-- (diagnose)
    33. 33. Questions?
    34. 34. http://jedi.be/bloghttps://github.com/monitoringsucks
    35. 35. doing awesome stuffhttp://www.atlassian.com/32
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×