Monitoring & Metrics              HOSTEDOPSPatrick Debois - http://jedi.be/blog
Recently started atProduct        Service
Availability      Uptime(Uptime + Downtime)
Availability vs Downtime                   (unplanned)    Availability             Downtime     90% (one 9)             36...
Decomposing                       Availability                                              MTBF   MTTF                   ...
Availability  MTTF  MTBF
Availability     MTTFMTTF+MTTD+MTTRMTTF   :   Mean   Time   To FailureMTTD   :   Mean   Time   To DiagnoseMTTR   :   Mean ...
Mean Time To Diagnose  ~ Mean Time to Detect  ~ Mean Time to Notification  ~ Mean Time to Respond  ~ Time to understand pro...
Mean Time To Repair ~ Time to restart/reset a service ~ Time to diagnose problem ~ Time to find information (~ Mean Time be...
Mean Time To Failure ~ Detect component fatigue ~ Check trends in capacity & usage ~ Design and component selection ~ Moni...
ChallengeSCALE 20.000 JVM in 60 Days
Sharing/Self Service• Radiate information• Think ‘opendata’/‘openapi’• Reduce middleman effort
Gmond           Metrics Collection Collectd  Nagios Analytics     “measurement of a particular characteristic” MetaOps (de...
Events Collection  Nagios Analytics                 “A thing that happens in time” MetaOps (deploys,syncs, etcc) Logstash ...
Log collectionLogstash           “A specialized form of event/metrics”           MTTD-- (detect)           MTTD-- (diagnos...
Fact collectionPuppet Facts                        “think CMDB”  Product  versions               MTTD-- (diagnose - with c...
Transport“Collection1”                            “Opentsdb”“Collection..”                           “Graphite ..”“Collect...
Storage               Graphite              Opentsdb                Logstash   Event     Elastic Search                   ...
Visualisation                  Deploy toolGraph Dashboard                     ~Etsy        Triage Tool~Etsy dashboard     ...
(service) APIs                                               Check/AlertGraph API     Ingest Data         Status API      ...
Checks & Alerts            Nagios       CEP            Check      (Esper)           Opentsdb                      Http che...
Check - Coverage        Business Perspective        (~ behavioral tests)        Technical Perspective      (diagnostics ~ ...
Check - Scaling                                   ControllerNagios 1Nagios 2                            QueueNagios N     ...
Notification Logic   • Collapse events   • When to alert or not   • Who to alert   • Dependencies   • Minor/Major/CriticalP...
Notification            Zendesk                     Hopsbot/Irc           Nagios Alert                    Email            ...
COLLECTION(metrics/events)                                                                                                ...
Automation• Drives faster changes• Frees up time for more useful things• Repeatability• Config Mgmt / Puppet
Orchestration• Workflow of automation• ~rundeck, mcollective, salt
Continuous Deployment • Same workflow for app & infra changes • Architecture should allow one button deploy • Independent o...
Definition of Done• Developed• Tested• Deployed              Even for our own                       infrastructure stuff• M...
Continuous Delivery          Automation    DEV        TEST      PROD       Monitoring & Metrics      Feedback/LearningFeed...
Beyond technical• Retrospectives / Post Mortems• Knowledge base• Pairing• Cross-Skilling  MTTD-- (diagnose)
Questions?
http://jedi.be/bloghttps://github.com/monitoringsucks
doing awesome stuffhttp://www.atlassian.com/32
Atlassian HostedOps Ondemand Monitoring
Upcoming SlideShare
Loading in...5
×

Atlassian HostedOps Ondemand Monitoring

3,981

Published on

This presentations contains partly previous, current and future state of our monitoring strategy. So don't implement it all :D

Published in: Technology, Design
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,981
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
48
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Atlassian HostedOps Ondemand Monitoring

    1. 1. Monitoring & Metrics HOSTEDOPSPatrick Debois - http://jedi.be/blog
    2. 2. Recently started atProduct Service
    3. 3. Availability Uptime(Uptime + Downtime)
    4. 4. Availability vs Downtime (unplanned) Availability Downtime 90% (one 9) 36.5 days/year 99% (two 9s) 3.65 days/year 99.9% (three 9s) 8.76 hours/year 99.99% (four 9s) 52 minutes/year 99.999% (five 9s) 5 minutes/year 99.9999% (six 9s) 31 seconds/year
    5. 5. Decomposing Availability MTBF MTTF MTTD MTTR MTTFCorrect Behavior Diagnose Repair Correct Behavior Time First Begin End Second Failure Repair Repair Failure MTTF : Mean Time To Failure MTTD : Mean Time To Diagnose MTTR : Mean Time To Repair MTBF : Mean Time Between Failure
    6. 6. Availability MTTF MTBF
    7. 7. Availability MTTFMTTF+MTTD+MTTRMTTF : Mean Time To FailureMTTD : Mean Time To DiagnoseMTTR : Mean Time To RepairMTBF : Mean Time Between Failure
    8. 8. Mean Time To Diagnose ~ Mean Time to Detect ~ Mean Time to Notification ~ Mean Time to Respond ~ Time to understand problem
    9. 9. Mean Time To Repair ~ Time to restart/reset a service ~ Time to diagnose problem ~ Time to find information (~ Mean Time between Blame)
    10. 10. Mean Time To Failure ~ Detect component fatigue ~ Check trends in capacity & usage ~ Design and component selection ~ Monitoring service abuses ~ Security Checks ~ Auditing/Verification Checks ~ Testing changes
    11. 11. ChallengeSCALE 20.000 JVM in 60 Days
    12. 12. Sharing/Self Service• Radiate information• Think ‘opendata’/‘openapi’• Reduce middleman effort
    13. 13. Gmond Metrics Collection Collectd Nagios Analytics “measurement of a particular characteristic” MetaOps (deploys,syncs, etcc) Logstash MTTF-- (prevent) Statsd MTTD-- (detect) Google MTTD-- (diagnose) Analytics MTBF++(science to drive changes) HIT Metrics
    14. 14. Events Collection Nagios Analytics “A thing that happens in time” MetaOps (deploys,syncs, etcc) Logstash MTTD-- (detect) MTTD-- (diagnose) Google MTBF++(science to drive changes) Analytics HIT Metrics
    15. 15. Log collectionLogstash “A specialized form of event/metrics” MTTD-- (detect) MTTD-- (diagnose) MTBF++(science to drive changes)
    16. 16. Fact collectionPuppet Facts “think CMDB” Product versions MTTD-- (diagnose - with context) MTBF++(science to drive changes)
    17. 17. Transport“Collection1” “Opentsdb”“Collection..” “Graphite ..”“CollectionM” “Storage N” Connection/Conversion explosion“Collection1” RabbitMQ “Opentsdb”“Collection..” 0MQ “Graphite”“CollectionM” P2P System “Storage N” Flexibility/Scaling/Opens up Information
    18. 18. Storage Graphite Opentsdb Logstash Event Elastic Search Meta Facts Store Nagios (~cmdb++)MTTD-- (diagnose - understand context)MTBF++(science to drive changes)
    19. 19. Visualisation Deploy toolGraph Dashboard ~Etsy Triage Tool~Etsy dashboard deployinator “Storage” MTTD-- (detect) MTTD-- (diagnose)
    20. 20. (service) APIs Check/AlertGraph API Ingest Data Status API API Extract NotificationSearch Data Data API “Polling vs Evented/Streamed API” MTTD-- (diagnose - understand context) MTBF++(science to drive changes) “Storage”
    21. 21. Checks & Alerts Nagios CEP Check (Esper) Opentsdb Http check check on on steroids steroids “Polling vs Evented/Streamed API”MTTD-- (detect)
    22. 22. Check - Coverage Business Perspective (~ behavioral tests) Technical Perspective (diagnostics ~ unit tests)MTTD-- (detect)
    23. 23. Check - Scaling ControllerNagios 1Nagios 2 QueueNagios N Workers 1 Workers 2 Workers 3 Evented ~ Eventmachine Worker Model ~ Sensu/Flapjack
    24. 24. Notification Logic • Collapse events • When to alert or not • Who to alert • Dependencies • Minor/Major/CriticalPrevent False Notification/Alert fatigue/Alert overflow~Failing CI cycle should be fixed ASAP - Hygiene
    25. 25. Notification Zendesk Hopsbot/Irc Nagios Alert Email MetaOps Pager Duty Notification APIMTTD-- (diagnose -> better notification)
    26. 26. COLLECTION(metrics/events) MONITORING OVERVIEW 19/01/2012 VISUALIZATION Gmond Deploy Graph Triage Tool Dashboard Dashboard Collectd Nagios STORAGE Event Opentsdb Graphite Storage Analytics Logstash (Externalized) API MetaOps Elastic Meta Facts (deploys, TRANSPORT search syncs, etc..) Graph API Status API Extract Data RabbitMq New Relic MSB Notification Ingest Data API Logsstash 0mq Search Data Event API Statsd P2P Tracker CHECKS/ALERT (is memory < 90%) NOTIFICATION LOGIC NOTIFICATION Google Analytics Nagios CEP some logic Zendesk Hopsbot Check (Esper) HIT Metrics Opentsdb Http check check on Nagios Alert Email on steroids steroids Prioritizing rules: Increase Uptime: MetaOps MSB (detect better) MTTD -> enhances SLA(resolve faster) MTTR -> enhances SLA Prevent/Predict Failure MTBF
    27. 27. Automation• Drives faster changes• Frees up time for more useful things• Repeatability• Config Mgmt / Puppet
    28. 28. Orchestration• Workflow of automation• ~rundeck, mcollective, salt
    29. 29. Continuous Deployment • Same workflow for app & infra changes • Architecture should allow one button deploy • Independent of approval cycle • Better Testing/Confidence MTTF-- : changes without downtime
    30. 30. Definition of Done• Developed• Tested• Deployed Even for our own infrastructure stuff• Monitored• Backup• Performance/Scales• Failover• Secure
    31. 31. Continuous Delivery Automation DEV TEST PROD Monitoring & Metrics Feedback/LearningFeedback can lead to automated fixing known states
    32. 32. Beyond technical• Retrospectives / Post Mortems• Knowledge base• Pairing• Cross-Skilling MTTD-- (diagnose)
    33. 33. Questions?
    34. 34. http://jedi.be/bloghttps://github.com/monitoringsucks
    35. 35. doing awesome stuffhttp://www.atlassian.com/32
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×