Incident Resolution as Code

Incident Resolution as Code
Julien Pivotto (@roidelapluie)
Config Management Camp
February 4th, 2019

user{name="roidelapluie"} 1
I like Open Source
I like monitoring
I like automation
... and all of that is my daily job at inuits

Monitoring
Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065

Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874

Creative Commons Public Domain https://pxhere.com/en/photo/265717

Traditional Monitoring
It works - OK
It does not work - CRITICAL
It kinda works - WARNING
I don't know - UNKNOWN

Creative Commons Public Domain https://pxhere.com/fr/photo/952999

Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511

Creative Commons Attribution-Share Alike 3.0 Unported
https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg

Real world
It works ; it does not work ; it kinda works ; it maybe
works ; no one uses it ; it is broken ; some things
are broken ; it should work but it does not ; where
are my users? help me...

The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just looking at the traffic light will not tell you
about the traffic jams.

Observability

Metrics
Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/

Metric
Name
Labels (Key-Value Pairs)
Value (Number)
Timestamp
Fetched at a high frequency

Business Metrics

CPU usage is no money
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/nox_noctis_silentium/3960497840

What are business metrics?
how you fullfil your customers' requests
quality and level of business service

Where are we?
Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/hernanpc/6259950189

What do we have?
metrics that tell us if business works
DB, Frontends, balancers, queing systems...
They don't come from the troublesome
component!

Traditional High Availability
LinuxHA, pacemaker, corosync
"health check" script
Restart, promote, balance traffic elsewhere

High Availability Nowadays
Multiple workers
Health exposed by the app
Load balancer balances to healthy nodes
Unhealty nodes are restarted automatically

When HA is not enough...
the processes are not "really" crashing
the component that has issues does not really
know about them (metrics available from DB ,
frontends, clients..)
there is no HA in place... (but still need 24x7
availability)

Async cases
Backlog can be delayed by a few hours
But not a complete week-end ...

Alerting
alert is fired = someone to take action
Runbooks to follow, depending on the alert
knowledge is built, then runbook == (ansible)
playbook

An ideal world
Creative Commons Attribution 2.0 https://www.flickr.com/photos/athomeinscottsdale/3247600886/

Ideally...
Memory leaks are fixed (quickly)
Multi master, redundant, in service discovery
You build it, you run it
Full control over 3rd parties (and their bugs..)

Ironically
Developers often just payed for features
Ops not working closely with devs
No "bugfix money" for stuff that do not happen
really often
Code base is 20y old and "it will be
decommisioned soon"

What's next?
Creative Commons Attribution 2.0 https://www.flickr.com/photos/janitors/15795816662/

if Frontends says backend responds slowly
then Restart the backend

if Lots of write errors towards NFS
then Balance traffic to another datacenter

Challenges
How to prevent concurrent ansible runs?
How to avoid large scale failure? e.g. errors in
playbook?
How to make sure it works?

Prometheus
Time series collection database
Computing rules, creating alerts
View on all the components
Open Source

Alertmanager
Receives alerts from prometheus
Propagates the alerts via webhooks
Open Source (part of Prometheus)

Ansible
Open Source orchestration/cfgmgmt
Acts on multiple servers
Knows your infra

Webhook that call Ansible
How/where to get the credentials? Where to
run Ansible?
Duration of the ansible run?
Which server to act upon?
Concurrent playbooks?

Queues
Alerts as JSON
Labels: env, playbook, limit
alert2amqp
AMQP protocol
Apache Artemis message broker (activemq
family)

Consumer
amqp2jenkins
Reads from queue
Launch a Jenkins job that will run ansible
Stop processing if a Jenkins job is failing
Ansible jobs take 2 input

Recap
In an ideal world we do not need this. Bugs are
fixed, techno is up to date, infra and apps are
reduntant.

Achievement
Metrics from 1, 2, X sources generate alerts that
are triggering automated resolution within minutes
towards different systems.
Common incidents get solved more quickly than
with people intervention.
People are woken up less often for known issue
with clear runbook.

Safeties
Needs monitoring to be up
Needs the last ansible run to be green
Whitelist upfrond
Discards "old alerts"
No concurrent execution
Alerts someone if not resolved in time

Use Cases
Must happen infrequently
Must not be predictable
Must not do more harm
Must impact daily work or on call

Code
https://github.com/roidelapluie/alert2amqp
https://github.com/roidelapluie/amqp2jenkins

Julien Pivotto
roidelapluie
roidelapluie@inuits.eu
Inuits
https://inuits.eu
info@inuits.eu
Contact

Incident Resolution as Code

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Incident Resolution as Code

Similar to Incident Resolution as Code (20)

More from Julien Pivotto

More from Julien Pivotto (20)

Recently uploaded

Recently uploaded (20)

Incident Resolution as Code