Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Incident Resolution as Code
Julien Pivotto (@roidelapluie)
Config Management Camp
February 4th, 2019
user{name="roidelapluie"} 1
I like Open Source
I like monitoring
I like automation
... and all of that is my daily job at ...
Monitoring
Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874
Creative Commons Public Domain https://pxhere.com/en/photo/265717
Creative Commons Attribution 2.0 https://www.flickr.com/photos/51809988@N06/5229933669
Traditional Monitoring
It works - OK
It does not work - CRITICAL
It kinda works - WARNING
I don't know - UNKNOWN
Creative Commons Public Domain https://pxhere.com/fr/photo/952999
Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511
Creative Commons Attribution-Share Alike 3.0 Unported
https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504...
Real world
It works ; it does not work ; it kinda works ; it maybe
works ; no one uses it ; it is broken ; some things
are...
The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just look...
Observability
Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
Metrics
Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/
Metric
Name
Labels (Key-Value Pairs)
Value (Number)
Timestamp
Fetched at a high frequency
Business Metrics
Creative Commons Attribution 2.0 https://www.flickr.com/photos/89228431@N06/11323330056
CPU usage is no money
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/nox_noctis_silentium/39604...
What are business metrics?
how you fullfil your customers' requests
quality and level of business service
Where are we?
Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/hernanpc/6259950189
What do we have?
metrics that tell us if business works
DB, Frontends, balancers, queing systems...
They don't come from t...
Recap on High Availability
Traditional High Availability
LinuxHA, pacemaker, corosync
"health check" script
Restart, promote, balance traffic elsewhe...
High Availability Nowadays
Multiple workers
Health exposed by the app
Load balancer balances to healthy nodes
Unhealty nod...
When HA is not enough...
the processes are not "really" crashing
the component that has issues does not really
know about ...
Async cases
Backlog can be delayed by a few hours
But not a complete week-end ...
Alerting
alert is fired = someone to take action
Runbooks to follow, depending on the alert
knowledge is built, then runbo...
An ideal world
Creative Commons Attribution 2.0 https://www.flickr.com/photos/athomeinscottsdale/3247600886/
Ideally...
Memory leaks are fixed (quickly)
Multi master, redundant, in service discovery
You build it, you run it
Full co...
Ironically
Developers often just payed for features
Ops not working closely with devs
No "bugfix money" for stuff that do ...
What's next?
Creative Commons Attribution 2.0 https://www.flickr.com/photos/janitors/15795816662/
if Frontends says backend responds slowly
then Restart the backend
if Lots of write errors towards NFS
then Balance traffic to another datacenter
Challenges
How to prevent concurrent ansible runs?
How to avoid large scale failure? e.g. errors in
playbook?
How to make ...
In pratice
Prometheus
Time series collection database
Computing rules, creating alerts
View on all the components
Open Source
Alertmanager
Receives alerts from prometheus
Propagates the alerts via webhooks
Open Source (part of Prometheus)
Ansible
Open Source orchestration/cfgmgmt
Acts on multiple servers
Knows your infra
Webhook that call Ansible
How/where to get the credentials? Where to
run Ansible?
Duration of the ansible run?
Which serve...
Queues
Alerts as JSON
Labels: env, playbook, limit
alert2amqp
AMQP protocol
Apache Artemis message broker (activemq
family)
Consumer
amqp2jenkins
Reads from queue
Launch a Jenkins job that will run ansible
Stop processing if a Jenkins job is fail...
Recap
In an ideal world we do not need this. Bugs are
fixed, techno is up to date, infra and apps are
reduntant.
Achievement
Metrics from 1, 2, X sources generate alerts that
are triggering automated resolution within minutes
towards d...
Safeties
Needs monitoring to be up
Needs the last ansible run to be green
Whitelist upfrond
Discards "old alerts"
No concu...
Use Cases
Must happen infrequently
Must not be predictable
Must not do more harm
Must impact daily work or on call
Code
https://github.com/roidelapluie/alert2amqp
https://github.com/roidelapluie/amqp2jenkins
Julien Pivotto
roidelapluie
roidelapluie@inuits.eu
Inuits
https://inuits.eu
info@inuits.eu
Contact
Incident Resolution as Code
Incident Resolution as Code
Incident Resolution as Code
Incident Resolution as Code
Incident Resolution as Code
Upcoming SlideShare
Loading in …5
×

Incident Resolution as Code

155 views

Published on

This talk explains how we use Prometheus, ansible to resolve incidents in an unattended way. We use for that amqp and Jenkins.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Incident Resolution as Code

  1. 1. Incident Resolution as Code Julien Pivotto (@roidelapluie) Config Management Camp February 4th, 2019
  2. 2. user{name="roidelapluie"} 1 I like Open Source I like monitoring I like automation ... and all of that is my daily job at inuits
  3. 3. Monitoring Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  4. 4. Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874
  5. 5. Creative Commons Public Domain https://pxhere.com/en/photo/265717
  6. 6. Creative Commons Attribution 2.0 https://www.flickr.com/photos/51809988@N06/5229933669
  7. 7. Traditional Monitoring It works - OK It does not work - CRITICAL It kinda works - WARNING I don't know - UNKNOWN
  8. 8. Creative Commons Public Domain https://pxhere.com/fr/photo/952999
  9. 9. Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511
  10. 10. Creative Commons Attribution-Share Alike 3.0 Unported https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg
  11. 11. Real world It works ; it does not work ; it kinda works ; it maybe works ; no one uses it ; it is broken ; some things are broken ; it should work but it does not ; where are my users? help me...
  12. 12. The Technical bias By looking at technical service, we miss the actual point Are we serving our users correctly? Just looking at the traffic light will not tell you about the traffic jams.
  13. 13. Observability Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065
  14. 14. Metrics Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/
  15. 15. Metric Name Labels (Key-Value Pairs) Value (Number) Timestamp Fetched at a high frequency
  16. 16. Business Metrics Creative Commons Attribution 2.0 https://www.flickr.com/photos/89228431@N06/11323330056
  17. 17. CPU usage is no money Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/nox_noctis_silentium/3960497840
  18. 18. What are business metrics? how you fullfil your customers' requests quality and level of business service
  19. 19. Where are we? Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/hernanpc/6259950189
  20. 20. What do we have? metrics that tell us if business works DB, Frontends, balancers, queing systems... They don't come from the troublesome component!
  21. 21. Recap on High Availability
  22. 22. Traditional High Availability LinuxHA, pacemaker, corosync "health check" script Restart, promote, balance traffic elsewhere
  23. 23. High Availability Nowadays Multiple workers Health exposed by the app Load balancer balances to healthy nodes Unhealty nodes are restarted automatically
  24. 24. When HA is not enough... the processes are not "really" crashing the component that has issues does not really know about them (metrics available from DB , frontends, clients..) there is no HA in place... (but still need 24x7 availability)
  25. 25. Async cases Backlog can be delayed by a few hours But not a complete week-end ...
  26. 26. Alerting alert is fired = someone to take action Runbooks to follow, depending on the alert knowledge is built, then runbook == (ansible) playbook
  27. 27. An ideal world Creative Commons Attribution 2.0 https://www.flickr.com/photos/athomeinscottsdale/3247600886/
  28. 28. Ideally... Memory leaks are fixed (quickly) Multi master, redundant, in service discovery You build it, you run it Full control over 3rd parties (and their bugs..)
  29. 29. Ironically Developers often just payed for features Ops not working closely with devs No "bugfix money" for stuff that do not happen really often Code base is 20y old and "it will be decommisioned soon"
  30. 30. What's next? Creative Commons Attribution 2.0 https://www.flickr.com/photos/janitors/15795816662/
  31. 31. if Frontends says backend responds slowly then Restart the backend
  32. 32. if Lots of write errors towards NFS then Balance traffic to another datacenter
  33. 33. Challenges How to prevent concurrent ansible runs? How to avoid large scale failure? e.g. errors in playbook? How to make sure it works?
  34. 34. In pratice
  35. 35. Prometheus Time series collection database Computing rules, creating alerts View on all the components Open Source
  36. 36. Alertmanager Receives alerts from prometheus Propagates the alerts via webhooks Open Source (part of Prometheus)
  37. 37. Ansible Open Source orchestration/cfgmgmt Acts on multiple servers Knows your infra
  38. 38. Webhook that call Ansible How/where to get the credentials? Where to run Ansible? Duration of the ansible run? Which server to act upon? Concurrent playbooks?
  39. 39. Queues Alerts as JSON Labels: env, playbook, limit alert2amqp AMQP protocol Apache Artemis message broker (activemq family)
  40. 40. Consumer amqp2jenkins Reads from queue Launch a Jenkins job that will run ansible Stop processing if a Jenkins job is failing Ansible jobs take 2 input
  41. 41. Recap In an ideal world we do not need this. Bugs are fixed, techno is up to date, infra and apps are reduntant.
  42. 42. Achievement Metrics from 1, 2, X sources generate alerts that are triggering automated resolution within minutes towards different systems. Common incidents get solved more quickly than with people intervention. People are woken up less often for known issue with clear runbook.
  43. 43. Safeties Needs monitoring to be up Needs the last ansible run to be green Whitelist upfrond Discards "old alerts" No concurrent execution Alerts someone if not resolved in time
  44. 44. Use Cases Must happen infrequently Must not be predictable Must not do more harm Must impact daily work or on call
  45. 45. Code https://github.com/roidelapluie/alert2amqp https://github.com/roidelapluie/amqp2jenkins
  46. 46. Julien Pivotto roidelapluie roidelapluie@inuits.eu Inuits https://inuits.eu info@inuits.eu Contact

×