Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Incident Resolution as Code Slide 1 Incident Resolution as Code Slide 2 Incident Resolution as Code Slide 3 Incident Resolution as Code Slide 4 Incident Resolution as Code Slide 5 Incident Resolution as Code Slide 6 Incident Resolution as Code Slide 7 Incident Resolution as Code Slide 8 Incident Resolution as Code Slide 9 Incident Resolution as Code Slide 10 Incident Resolution as Code Slide 11 Incident Resolution as Code Slide 12 Incident Resolution as Code Slide 13 Incident Resolution as Code Slide 14 Incident Resolution as Code Slide 15 Incident Resolution as Code Slide 16 Incident Resolution as Code Slide 17 Incident Resolution as Code Slide 18 Incident Resolution as Code Slide 19 Incident Resolution as Code Slide 20 Incident Resolution as Code Slide 21 Incident Resolution as Code Slide 22 Incident Resolution as Code Slide 23 Incident Resolution as Code Slide 24 Incident Resolution as Code Slide 25 Incident Resolution as Code Slide 26 Incident Resolution as Code Slide 27 Incident Resolution as Code Slide 28 Incident Resolution as Code Slide 29 Incident Resolution as Code Slide 30 Incident Resolution as Code Slide 31 Incident Resolution as Code Slide 32 Incident Resolution as Code Slide 33 Incident Resolution as Code Slide 34 Incident Resolution as Code Slide 35 Incident Resolution as Code Slide 36 Incident Resolution as Code Slide 37 Incident Resolution as Code Slide 38 Incident Resolution as Code Slide 39 Incident Resolution as Code Slide 40 Incident Resolution as Code Slide 41 Incident Resolution as Code Slide 42 Incident Resolution as Code Slide 43 Incident Resolution as Code Slide 44 Incident Resolution as Code Slide 45 Incident Resolution as Code Slide 46 Incident Resolution as Code Slide 47 Incident Resolution as Code Slide 48 Incident Resolution as Code Slide 49 Incident Resolution as Code Slide 50 Incident Resolution as Code Slide 51
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Incident Resolution as Code

Download to read offline

This talk explains how we use Prometheus, ansible to resolve incidents in an unattended way. We use for that amqp and Jenkins.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Incident Resolution as Code

  1. 1. Incident Resolution as Code Julien Pivotto (@roidelapluie) Config Management Camp February 4th, 2019
  2. 2. user{name="roidelapluie"} 1 I like Open Source I like monitoring I like automation ... and all of that is my daily job at inuits
  3. 3. Monitoring Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  4. 4. Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874
  5. 5. Creative Commons Public Domain https://pxhere.com/en/photo/265717
  6. 6. Creative Commons Attribution 2.0 https://www.flickr.com/photos/51809988@N06/5229933669
  7. 7. Traditional Monitoring It works - OK It does not work - CRITICAL It kinda works - WARNING I don't know - UNKNOWN
  8. 8. Creative Commons Public Domain https://pxhere.com/fr/photo/952999
  9. 9. Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511
  10. 10. Creative Commons Attribution-Share Alike 3.0 Unported https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg
  11. 11. Real world It works ; it does not work ; it kinda works ; it maybe works ; no one uses it ; it is broken ; some things are broken ; it should work but it does not ; where are my users? help me...
  12. 12. The Technical bias By looking at technical service, we miss the actual point Are we serving our users correctly? Just looking at the traffic light will not tell you about the traffic jams.
  13. 13. Observability Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  14. 14. Metrics Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/
  15. 15. Metric Name Labels (Key-Value Pairs) Value (Number) Timestamp Fetched at a high frequency
  16. 16. Business Metrics Creative Commons Attribution 2.0 https://www.flickr.com/photos/89228431@N06/11323330056
  17. 17. CPU usage is no money Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/nox_noctis_silentium/3960497840
  18. 18. What are business metrics? how you fullfil your customers' requests quality and level of business service
  19. 19. Where are we? Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/hernanpc/6259950189
  20. 20. What do we have? metrics that tell us if business works DB, Frontends, balancers, queing systems... They don't come from the troublesome component!
  21. 21. Recap on High Availability
  22. 22. Traditional High Availability LinuxHA, pacemaker, corosync "health check" script Restart, promote, balance traffic elsewhere
  23. 23. High Availability Nowadays Multiple workers Health exposed by the app Load balancer balances to healthy nodes Unhealty nodes are restarted automatically
  24. 24. When HA is not enough... the processes are not "really" crashing the component that has issues does not really know about them (metrics available from DB , frontends, clients..) there is no HA in place... (but still need 24x7 availability)
  25. 25. Async cases Backlog can be delayed by a few hours But not a complete week-end ...
  26. 26. Alerting alert is fired = someone to take action Runbooks to follow, depending on the alert knowledge is built, then runbook == (ansible) playbook
  27. 27. An ideal world Creative Commons Attribution 2.0 https://www.flickr.com/photos/athomeinscottsdale/3247600886/
  28. 28. Ideally... Memory leaks are fixed (quickly) Multi master, redundant, in service discovery You build it, you run it Full control over 3rd parties (and their bugs..)
  29. 29. Ironically Developers often just payed for features Ops not working closely with devs No "bugfix money" for stuff that do not happen really often Code base is 20y old and "it will be decommisioned soon"
  30. 30. What's next? Creative Commons Attribution 2.0 https://www.flickr.com/photos/janitors/15795816662/
  31. 31. if Frontends says backend responds slowly then Restart the backend
  32. 32. if Lots of write errors towards NFS then Balance traffic to another datacenter
  33. 33. Challenges How to prevent concurrent ansible runs? How to avoid large scale failure? e.g. errors in playbook? How to make sure it works?
  34. 34. In pratice
  35. 35. Prometheus Time series collection database Computing rules, creating alerts View on all the components Open Source
  36. 36. Alertmanager Receives alerts from prometheus Propagates the alerts via webhooks Open Source (part of Prometheus)
  37. 37. Ansible Open Source orchestration/cfgmgmt Acts on multiple servers Knows your infra
  38. 38. Webhook that call Ansible How/where to get the credentials? Where to run Ansible? Duration of the ansible run? Which server to act upon? Concurrent playbooks?
  39. 39. Queues Alerts as JSON Labels: env, playbook, limit alert2amqp AMQP protocol Apache Artemis message broker (activemq family)
  40. 40. Consumer amqp2jenkins Reads from queue Launch a Jenkins job that will run ansible Stop processing if a Jenkins job is failing Ansible jobs take 2 input
  41. 41. Recap In an ideal world we do not need this. Bugs are fixed, techno is up to date, infra and apps are reduntant.
  42. 42. Achievement Metrics from 1, 2, X sources generate alerts that are triggering automated resolution within minutes towards different systems. Common incidents get solved more quickly than with people intervention. People are woken up less often for known issue with clear runbook.
  43. 43. Safeties Needs monitoring to be up Needs the last ansible run to be green Whitelist upfrond Discards "old alerts" No concurrent execution Alerts someone if not resolved in time
  44. 44. Use Cases Must happen infrequently Must not be predictable Must not do more harm Must impact daily work or on call
  45. 45. Code https://github.com/roidelapluie/alert2amqp https://github.com/roidelapluie/amqp2jenkins
  46. 46. Julien Pivotto roidelapluie roidelapluie@inuits.eu Inuits https://inuits.eu info@inuits.eu Contact
  • IvanRomanko

    Jan. 9, 2020

This talk explains how we use Prometheus, ansible to resolve incidents in an unattended way. We use for that amqp and Jenkins.

Views

Total views

752

On Slideshare

0

From embeds

0

Number of embeds

13

Actions

Downloads

16

Shares

0

Comments

0

Likes

1

×