Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Incident Resolution as Code

43 views

Published on

This talk explains how we use Prometheus, ansible to resolve incidents in an unattended way. We use for that amqp and Jenkins.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Incident Resolution as Code

  1. 1. Incident Resolution as Code Julien Pivotto (@roidelapluie) Config Management Camp February 4th, 2019
  2. 2. user{name="roidelapluie"} 1 I like Open Source I like monitoring I like automation ... and all of that is my daily job at inuits
  3. 3. Monitoring Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  4. 4. Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874
  5. 5. Creative Commons Public Domain https://pxhere.com/en/photo/265717
  6. 6. Creative Commons Attribution 2.0 https://www.flickr.com/photos/51809988@N06/5229933669
  7. 7. Traditional Monitoring It works - OK It does not work - CRITICAL It kinda works - WARNING I don't know - UNKNOWN
  8. 8. Creative Commons Public Domain https://pxhere.com/fr/photo/952999
  9. 9. Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511
  10. 10. Creative Commons Attribution-Share Alike 3.0 Unported https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg
  11. 11. Real world It works ; it does not work ; it kinda works ; it maybe works ; no one uses it ; it is broken ; some things are broken ; it should work but it does not ; where are my users? help me...
  12. 12. The Technical bias By looking at technical service, we miss the actual point Are we serving our users correctly? Just looking at the traffic light will not tell you about the traffic jams.
  13. 13. Observability Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  14. 14. Metrics Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/
  15. 15. Metric Name Labels (Key-Value Pairs) Value (Number) Timestamp Fetched at a high frequency
  16. 16. Business Metrics Creative Commons Attribution 2.0 https://www.flickr.com/photos/89228431@N06/11323330056
  17. 17. CPU usage is no money Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/nox_noctis_silentium/3960497840
  18. 18. What are business metrics? how you fullfil your customers' requests quality and level of business service
  19. 19. Where are we? Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/hernanpc/6259950189
  20. 20. What do we have? metrics that tell us if business works DB, Frontends, balancers, queing systems... They don't come from the troublesome component!
  21. 21. Recap on High Availability
  22. 22. Traditional High Availability LinuxHA, pacemaker, corosync "health check" script Restart, promote, balance traffic elsewhere
  23. 23. High Availability Nowadays Multiple workers Health exposed by the app Load balancer balances to healthy nodes Unhealty nodes are restarted automatically
  24. 24. When HA is not enough... the processes are not "really" crashing the component that has issues does not really know about them (metrics available from DB , frontends, clients..) there is no HA in place... (but still need 24x7 availability)
  25. 25. Async cases Backlog can be delayed by a few hours But not a complete week-end ...
  26. 26. Alerting alert is fired = someone to take action Runbooks to follow, depending on the alert knowledge is built, then runbook == (ansible) playbook
  27. 27. An ideal world Creative Commons Attribution 2.0 https://www.flickr.com/photos/athomeinscottsdale/3247600886/
  28. 28. Ideally... Memory leaks are fixed (quickly) Multi master, redundant, in service discovery You build it, you run it Full control over 3rd parties (and their bugs..)
  29. 29. Ironically Developers often just payed for features Ops not working closely with devs No "bugfix money" for stuff that do not happen really often Code base is 20y old and "it will be decommisioned soon"
  30. 30. What's next? Creative Commons Attribution 2.0 https://www.flickr.com/photos/janitors/15795816662/
  31. 31. if Frontends says backend responds slowly then Restart the backend
  32. 32. if Lots of write errors towards NFS then Balance traffic to another datacenter
  33. 33. Challenges How to prevent concurrent ansible runs? How to avoid large scale failure? e.g. errors in playbook? How to make sure it works?
  34. 34. In pratice
  35. 35. Prometheus Time series collection database Computing rules, creating alerts View on all the components Open Source
  36. 36. Alertmanager Receives alerts from prometheus Propagates the alerts via webhooks Open Source (part of Prometheus)
  37. 37. Ansible Open Source orchestration/cfgmgmt Acts on multiple servers Knows your infra
  38. 38. Webhook that call Ansible How/where to get the credentials? Where to run Ansible? Duration of the ansible run? Which server to act upon? Concurrent playbooks?
  39. 39. Queues Alerts as JSON Labels: env, playbook, limit alert2amqp AMQP protocol Apache Artemis message broker (activemq family)
  40. 40. Consumer amqp2jenkins Reads from queue Launch a Jenkins job that will run ansible Stop processing if a Jenkins job is failing Ansible jobs take 2 input
  41. 41. Recap In an ideal world we do not need this. Bugs are fixed, techno is up to date, infra and apps are reduntant.
  42. 42. Achievement Metrics from 1, 2, X sources generate alerts that are triggering automated resolution within minutes towards different systems. Common incidents get solved more quickly than with people intervention. People are woken up less often for known issue with clear runbook.
  43. 43. Safeties Needs monitoring to be up Needs the last ansible run to be green Whitelist upfrond Discards "old alerts" No concurrent execution Alerts someone if not resolved in time
  44. 44. Use Cases Must happen infrequently Must not be predictable Must not do more harm Must impact daily work or on call
  45. 45. Code https://github.com/roidelapluie/alert2amqp https://github.com/roidelapluie/amqp2jenkins
  46. 46. Julien Pivotto roidelapluie roidelapluie@inuits.eu Inuits https://inuits.eu info@inuits.eu Contact

×