Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical Monitoring Techniques

Best practices and tools we use at AppsFlyer to ease on call duty for DevOps and Developers

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Practical Monitoring Techniques

  1. 1. Practical Monitoring Techniques
  2. 2. Today's Talk ● Our Mission ● Current Tools ● Increasing Coverage ● PD Schedules ● Automatic Self Healing ● Bots And Alerts channels ● Events Dashboard ● Dashboard Accessibility ● Best Practices Summary
  3. 3. Our Mission Back up culture with the proper tools to support it
  4. 4. Current Tools ● Metrics collections: Collectd, statsd, Cloudwatch ● Monitoring: Sensu, NewRelic ● Alert channels: PagerDuty, emails, slack ● Dashboards: Grafana, CloudWatch, NewRelic ● Application testing: E2E Testing System ● Internal tools: Sensu mobile, events system, Sensu bar and more
  5. 5. Increasing Coverage ● Automatic collection of basic system and 3rd party metrics for new instances ● Add alerts automatically for new instance of existed subscriber ● Each Developer / DevOps is responsible for monitoring his application / infrastructure ● Easy method to add new alerts and dashboards ● Automatic events flow
  6. 6. Pager Schedules ● Divided into logical groups of ownership ● Schedule has escalation point ● On call should be able to connect and respond to issues in his area ● Easy method to override schedule ● Ability to contact relevant on call ● Ability to page relevant on call
  7. 7. Automatic Self Healing ● Better MTTR ● Avoid waking On Call if possible ● Log activity to float recurrent issues ● Limit the healing to avoid restart loops ● Make sure to sync Healer Alert↔
  8. 8. Bots, Integrations and Alerts Channels ● Alerts channels: Emails, slack, PD mobile, sms, calls ● Integrations: Sensu to PD/Slack, CloudWatch to PD, 3rd party (EX: CouchBase, NewRelic, etc) to PD, ● Slack Bot:
  9. 9. Events Dashboard ● Simple Rest API for sending events ● Clean timeline view to spot production events ● Connections between events (“depends on” and “dependents”) ● Detailed view for each event
  10. 10. Accessibility ● Available from everywhere by mobile ● Easy to ack, resolve, mute alerts ● Slack bots to reach help ● Automatically get graph with the alert ● Ability to search, edit, copy, etc alerts ● Treat alerts management as code (SVC, DB, backups, etc)
  11. 11. Best Practices Summary ● Share the pain ● Automate base metrics ● Automate healing ● Make help reachable ● Make it easy to add alerts and dashboards ● Use warning levels as soft events to avoid phone calls at night ● Automate graphs in alerts ● Positive alerting system check each day ● Dependencies between alerts ● Postmortems
  12. 12. Questions

    Be the first to comment

    Login to see the comments

Best practices and tools we use at AppsFlyer to ease on call duty for DevOps and Developers

Views

Total views

597

On Slideshare

0

From embeds

0

Number of embeds

11

Actions

Downloads

9

Shares

0

Comments

0

Likes

0

×