Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Failures happen. Building resilient cloud infrastructure requires an end-to-end automated approach to failure remediation. This approach must go beyond the current DevOps model of monitoring the system and getting engineers alerted when a failure condition occurs.
Recently, event driven automation and workflows re-emerged as a way to automate troubleshooting, remediation, and a variety of Day-2 operations. Facebook famously uses FBAR to "save 16,000 engineer-hours, a day, in ops". Similar approaches had been reported by other hyper-scale cloud providers. Open-source auto-remediation platforms like StackStorm are replacing legacy Runbook automation products, and have been successfully used to automate applications, networks, security, and cloud infrastructure.
In this presentation we give a brief history of workflow automation, overview the common architecture ingredients of a typical event driven automation framework, compare and contrast alternative approaches to day-2 automation, and, most importantly, share real-world use cases and examples of applying event driven automation in operations.
25. Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
And now with Winston.
They started using Winston for cassandra auto-remediation, and it grew into remediation-as-a-service.
Presentation on QCon.
Winston gets the Alert. Using its rule engine decide what the right action is. Action then analyse the issue and if it’s identified as a False Positive, no need to Page the on-call.
Another use case is that Winston will identify that it can fix the issue. When it does, again, no need to Page the on-call.
Last use case, the one we want you to focus on is Assisted Diagnostics. While the on-call is being Paged, Winston runs a series of pre-defined diagnostics and prepare a report for the On-call so that when he logs in the system, he has comprehensive information like the Discovery status, list of recent exceptions or error, or any other relevant context to help him make a decision faster.
Now let’s talk about workflows
remember that workflow is a part of event driven automation… but a very important part
Sequence: tasks run one after another.
Typical remediation sequence: update config, clean the logs, restart the server.
Note the workflow definition: name of the task, action with input, transition. Simple, concise, readable YAML.
Data passing: workflow ability carry the data downstream, and efficiently refer those data, is the key.
In this example, troubleshooting results obtained by task 1 are published to chatops by task 2.
We can refer the task results directly, or “publish” a named variable for convenience.
This funny syntax here is YAQL – yet another query langue – we prefered it over JINGA for extensibility and type support.
Simple conditions:
simply – deploy app, on success – post to chat, on failure, page admin on call.
Conditions can be based on data:
Conditions can be based on data:
This workflow runs switch diagnostic action, that may be just a shell script, and act based on the return code. Most common pattern.
And that’s it!
In my view, that set of patterns is sufficient. To make it “efficient”, we may want few more patterns.
Parallel task execution.
This example is from our own CI: we use stackstorm to build stackstorm.
When it is built and packaged, I deploy and test it on 3 operation systems. Obviously, in parallel.
Now that the execution is split in parallel, how to join it?
How to get this humpy-dumpy back together again? It’s not easy.
According to workflow patterns, there are 16 ways to join.
How many times t5 is going to run, and how, depends of the type of join.
Simple merge. T5 runs 3 times, one for each upstream execution.
That’s what I want here: report the completion on each of parallel tasks.
Now: to tag the release, we want the tests on all 3 operation systems passed.
That is what “AND” join pattern will do.
If on the other hand, if any of the OS tests fails, we don’t wait for the rest to call it a failure.
In this example, t5 also only runs one, but it will do so on whatever upstream tasks comes first, and workflow moves on.
This join is called “discriminator”, because US legal compliance people didn’t review workflow pattern language yet…
Finally, “multiple data”.
People ask “can workflow have loops”? My answer is “it can but you don’t want it”.
If all you need is the same action run on a set of data, use “this pattern”. In Mistal, the keyword for it is “with-items”.
Here, task 1 gets the list of available ip addresses from inventory system, and task 2 uses them as an input to create vm action.
Here is a cool thing about Mistral workflow: actions run in parallel, AND, you can control concurrency.
That’s it, that’s all you need.
This is the minimal set that gives enough power but keeps workflows simple to create, track, and reason.