Atlassian HostedOps Ondemand Monitoring
Upcoming SlideShare
Loading in...5
×
 

Atlassian HostedOps Ondemand Monitoring

on

  • 4,240 views

This presentations contains partly previous, current and future state of our monitoring strategy. So don't implement it all :D

This presentations contains partly previous, current and future state of our monitoring strategy. So don't implement it all :D

Statistics

Views

Total Views
4,240
Views on SlideShare
4,200
Embed Views
40

Actions

Likes
15
Downloads
46
Comments
0

4 Embeds 40

http://a0.twimg.com 22
https://twitter.com 11
https://confluence.softserveinc.com 6
https://extranet.atlassian.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Atlassian HostedOps Ondemand Monitoring Atlassian HostedOps Ondemand Monitoring Presentation Transcript

  • Monitoring & Metrics HOSTEDOPSPatrick Debois - http://jedi.be/blog
  • Recently started atProduct Service
  • Availability Uptime(Uptime + Downtime)
  • Availability vs Downtime (unplanned) Availability Downtime 90% (one 9) 36.5 days/year 99% (two 9s) 3.65 days/year 99.9% (three 9s) 8.76 hours/year 99.99% (four 9s) 52 minutes/year 99.999% (five 9s) 5 minutes/year 99.9999% (six 9s) 31 seconds/year
  • Decomposing Availability MTBF MTTF MTTD MTTR MTTFCorrect Behavior Diagnose Repair Correct Behavior Time First Begin End Second Failure Repair Repair Failure MTTF : Mean Time To Failure MTTD : Mean Time To Diagnose MTTR : Mean Time To Repair MTBF : Mean Time Between Failure
  • Availability MTTF MTBF
  • Availability MTTFMTTF+MTTD+MTTRMTTF : Mean Time To FailureMTTD : Mean Time To DiagnoseMTTR : Mean Time To RepairMTBF : Mean Time Between Failure
  • Mean Time To Diagnose ~ Mean Time to Detect ~ Mean Time to Notification ~ Mean Time to Respond ~ Time to understand problem
  • Mean Time To Repair ~ Time to restart/reset a service ~ Time to diagnose problem ~ Time to find information (~ Mean Time between Blame)
  • Mean Time To Failure ~ Detect component fatigue ~ Check trends in capacity & usage ~ Design and component selection ~ Monitoring service abuses ~ Security Checks ~ Auditing/Verification Checks ~ Testing changes
  • ChallengeSCALE 20.000 JVM in 60 Days
  • Sharing/Self Service• Radiate information• Think ‘opendata’/‘openapi’• Reduce middleman effort
  • Gmond Metrics Collection Collectd Nagios Analytics “measurement of a particular characteristic” MetaOps (deploys,syncs, etcc) Logstash MTTF-- (prevent) Statsd MTTD-- (detect) Google MTTD-- (diagnose) Analytics MTBF++(science to drive changes) HIT Metrics
  • Events Collection Nagios Analytics “A thing that happens in time” MetaOps (deploys,syncs, etcc) Logstash MTTD-- (detect) MTTD-- (diagnose) Google MTBF++(science to drive changes) Analytics HIT Metrics
  • Log collectionLogstash “A specialized form of event/metrics” MTTD-- (detect) MTTD-- (diagnose) MTBF++(science to drive changes)
  • Fact collectionPuppet Facts “think CMDB” Product versions MTTD-- (diagnose - with context) MTBF++(science to drive changes)
  • Transport“Collection1” “Opentsdb”“Collection..” “Graphite ..”“CollectionM” “Storage N” Connection/Conversion explosion“Collection1” RabbitMQ “Opentsdb”“Collection..” 0MQ “Graphite”“CollectionM” P2P System “Storage N” Flexibility/Scaling/Opens up Information
  • Storage Graphite Opentsdb Logstash Event Elastic Search Meta Facts Store Nagios (~cmdb++)MTTD-- (diagnose - understand context)MTBF++(science to drive changes)
  • Visualisation Deploy toolGraph Dashboard ~Etsy Triage Tool~Etsy dashboard deployinator “Storage” MTTD-- (detect) MTTD-- (diagnose)
  • (service) APIs Check/AlertGraph API Ingest Data Status API API Extract NotificationSearch Data Data API “Polling vs Evented/Streamed API” MTTD-- (diagnose - understand context) MTBF++(science to drive changes) “Storage”
  • Checks & Alerts Nagios CEP Check (Esper) Opentsdb Http check check on on steroids steroids “Polling vs Evented/Streamed API”MTTD-- (detect)
  • Check - Coverage Business Perspective (~ behavioral tests) Technical Perspective (diagnostics ~ unit tests)MTTD-- (detect)
  • Check - Scaling ControllerNagios 1Nagios 2 QueueNagios N Workers 1 Workers 2 Workers 3 Evented ~ Eventmachine Worker Model ~ Sensu/Flapjack
  • Notification Logic • Collapse events • When to alert or not • Who to alert • Dependencies • Minor/Major/CriticalPrevent False Notification/Alert fatigue/Alert overflow~Failing CI cycle should be fixed ASAP - Hygiene
  • Notification Zendesk Hopsbot/Irc Nagios Alert Email MetaOps Pager Duty Notification APIMTTD-- (diagnose -> better notification)
  • COLLECTION(metrics/events) MONITORING OVERVIEW 19/01/2012 VISUALIZATION Gmond Deploy Graph Triage Tool Dashboard Dashboard Collectd Nagios STORAGE Event Opentsdb Graphite Storage Analytics Logstash (Externalized) API MetaOps Elastic Meta Facts (deploys, TRANSPORT search syncs, etc..) Graph API Status API Extract Data RabbitMq New Relic MSB Notification Ingest Data API Logsstash 0mq Search Data Event API Statsd P2P Tracker CHECKS/ALERT (is memory < 90%) NOTIFICATION LOGIC NOTIFICATION Google Analytics Nagios CEP some logic Zendesk Hopsbot Check (Esper) HIT Metrics Opentsdb Http check check on Nagios Alert Email on steroids steroids Prioritizing rules: Increase Uptime: MetaOps MSB (detect better) MTTD -> enhances SLA(resolve faster) MTTR -> enhances SLA Prevent/Predict Failure MTBF
  • Automation• Drives faster changes• Frees up time for more useful things• Repeatability• Config Mgmt / Puppet
  • Orchestration• Workflow of automation• ~rundeck, mcollective, salt
  • Continuous Deployment • Same workflow for app & infra changes • Architecture should allow one button deploy • Independent of approval cycle • Better Testing/Confidence MTTF-- : changes without downtime
  • Definition of Done• Developed• Tested• Deployed Even for our own infrastructure stuff• Monitored• Backup• Performance/Scales• Failover• Secure
  • Continuous Delivery Automation DEV TEST PROD Monitoring & Metrics Feedback/LearningFeedback can lead to automated fixing known states
  • Beyond technical• Retrospectives / Post Mortems• Knowledge base• Pairing• Cross-Skilling MTTD-- (diagnose)
  • Questions?
  • http://jedi.be/bloghttps://github.com/monitoringsucks
  • doing awesome stuffhttp://www.atlassian.com/32