Just like anything in IT, automation is a tool. And any tool can be used incorrectly. In this talk, we discuss a few examples of automation (or lack thereof) gone wrong.
7. Source:
https://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919/
“The key flaw that caused this outage to be so severe was
an unfortunate handling of an error condition. An
automated system for verifying configuration values
ended up causing much more damage than it fixed.
The intent of the automated system is to check for
configuration values that are invalid in the cache and
replace them with updated values from the persistent
store. This works well for a transient problem with the
cache, but it doesn’t work when the persistent store is
invalid.”
10. Source:
https://aws.amazon.com/message/41926/
The Amazon Simple Storage Service (S3) team was debugging an issue
causing the S3 billing system to progress more slowly than expected.
At 9:37AM PST, an authorized S3 team member using an established
playbook executed a command which was intended to remove a small
number of servers for one of the S3 subsystems that is used by the S3
billing process. Unfortunately, one of the inputs to the command was
entered incorrectly and a larger set of servers was removed than
intended.
11. Source:
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
Manual Steps
When Amazon's Availability Zone (AZ) started failing
we decided to get out of the zone all together. This
meant making significant changes to our AWS
configuration. While we have tools to change
individual aspects of our AWS deployment and
configuration they are not currently designed to enact
wholesale changes, such as moving sets of services out
of a zone completely. This meant that we had to
engage with each of the service teams to make the
manual (and potentially error prone) changes. In the
future we will be working to automate this process, so
it will scale for a company of our size and growth rate.
12. Why Do These Outages
Happen,
Despite Automation
Being Used?
13. Why We Still Have Problems...
Our environments are becoming
increasingly complex:
1. Manual steps == human error
2. Microservices are popular, but even
simple LB/web/middleware/db setups
can have dozens of failure points
3. Failure to automate rollback/failover
14. Why We Still Have Problems...
A lot of older systems
exist, which have to be
interfaced with, and
generally don't provide a
lot of modern datacenter
protections.
Photo Credit:
https://www.flickr.com/photos/pargon/2444943158
17. What Else Can You Automate?
If it has a remote API, you can
automate it (with Ansible).
https://github.com/jimi-c/hue
18. Network Ops
In 2016, almost all major internet service
outages were caused by one of two problems:
1) DDoS attacks
2) BGP configuration mistakes
19. Build Safety Checks In By Default
1)How could we prevent the S3 outage?
2)How could we prevent accidentally running `rm -rf /`?
- name: set number of active servers
ec2:
image: ami-123456
count: “{{number_of_servers}}”
when: number_of_servers > 10
- name: delete some path
shell: rm -rf {{some_path}}/
when: some_path is defined and some_path != “”
20. Other Best Practices
1) Try to use built-in modules before reverting to shell/script
commands.
2) Prefix variable names, especially for something generic like “port”,
especially when using them with Ansible roles.
3) Keep it simple.
- name: delete some path
file:
path: “{{some_path}}/”
state: absent
In 2012, Gary Bernhardt gave a talk at CodeMash entitled "wat".
For those who may not have seen it before, the focus of this talk was this...
Basically he walks through a few examples of some programming language quirks. Javascript (somewhat deservingly) gets most of the attention, as shown in this screen grab.
If you haven’t watched it, I highly recommend it because it’s a very funny and memorable talk.
So my coworker Greg DeKoenigsberg and I were kicking around ideas for me to use here at dotScale, and we thought it would be fun to riff off this in terms of automation, after which he coined the following term:
Basically, I’d like to discuss some times that admins and operations teams were using automation in, shall we say, less than optimal ways.
How many of you are using automation for somethings, but not everything? And by everything I mean testing, CI/CD, failovers, scaling up, scaling down, backup recovery -
Everything (as Gary Oldman says here).
There has certainly been a big push to automate things thanks to the DevOps movement. But unfortunately, automation can still cause you problems if you’re not careful.
How?
First up, we have an example of automation reacting in unexpected ways. In 2010, Facebook had a pretty major outage, caused by a piece of automation software they wrote that was supposed to help fix things, but really just made things quite worse.
Why? Because it wasn’t designed to deal with a persistent error, only a transient one.
The result was that every host started hammering the database trying to handle the problem, resulting in a self-DDoS.
It’s always difficult to handle unforeseen corner-cases.
Next up, we have this Time-Warner outage from 2014, in which an incorrect network configuration was automatically distributed across their network devices, resulting in a major outage that impacted 29 states in the US and 11 million customers.
Just over a year ago, a guy created a Stack Exchange post, in which he claimed he accidentally wiped out his entire web server farm due to an Ansible playbook mistake.
It later came out that this was all a hoax, in which he was attempting to create a viral marketing campaign for his new business, but what he essentially claimed was that by leaving the variables in the above Ansible snippet undefined, he accidentally did a recursive removal of all files on every one of his servers.
A lot of people very quickly pointed out that that’s not how Ansible works, undefined variables like this would raise an error.
However if you DID accidentally write an Ansible task like this and initialized or defaulted those variables to empty strings you might have a very bad day in front of you.
This is not out of the realm of possibility...
And then we have this, which I’m sure almost everyone here is familiar with.
This is an unfortunate example of a real-world incident of the previous slide. An engineer set a variable to a value that was bigger than expected, which resulted in too many servers in that environment getting removed. Unfortunately, things had not been designed to tolerate that level of removal, so a lot of things had to be restarted and resynced, which took a LONG time.
As always, computers do exactly what we tell them to do (at least usually).
Problems with automation aren’t just related to user error or unforeseen conditions.
Most frequently, they happen when you haven’t anticipated needing to automate certain aspects of your recovery and have to do something manually to recover.
Here, the Netflix team had to deal with a major AWS outage by manually moving services to another zone, for which they had no automated process. Luckily for them, they were able to do it without impacting customers, but not everyone is so lucky.
So why do these outages happen, despite automation being used?
Our environments are becoming increasingly complex.
1. Manual steps == human error2. Microservices are popular, but even simple LB/web/middleware/db setups can have dozens of failure points3. Failure to automate rollback/failover
2) A lot of older systems exist, which have to be interfaced with and generally don't provide a lot of modern datacenter protections.
On top of that, they’re often quite expensive. In a previous company, we had a mainframe from a company whose nickname, en Francaise est “Le Grand Bleu”.
The bill for the memory alone was close to $1 million dollars. The storage used was getting close to a petabyte.
That's not an easy monolithic system to fail over, even with automation. This is also illustrated by long times to do maintenance tasks, like the S3 re-indexing needed when the service was restarted.
So, what can we do?
Automate more.
While automation doesn’t automatically solve every problem for you, it does help stop you from repeating past mistakes.
Also, there are still a LOT of areas where automation is not used very much...
What else can you automate?
Well, if you’re unsure whether or not you can automate something, remember this:
If it has a remote API, you can automate it. I can’t really speak to other systems, but with Ansible, it’s really quite easy to write modules to manage things with APIs.
A talk I’ve given several times recently, including here in Paris in Feburary, was all about controlling Phillips Hue lights with Ansible. You can find the source code and playbooks for doing this at the link here, which I’ll leave up for a minute so anyone interested can take note of it.
Networking is one of the most important components of any datacenter, and yet if you look at a lot of the major outages in 2016 they were caused by one of two things:
1) DDoS attacks (which we can't do much about outside of things like CloudFlare, CloudFront, etc.).
2) BGP configuration mistakes.
For those who may not know, BGP is the routing protocol used to link major networks on the internet.
When mistakes are made, it very quickly causes major problems on the internet at large.
A little plug for Ansible here: This is by far the largest area we’ve seen Ansible expand into over the last year, even more so than Docker and Containers. Why?
1) No agents. I've never met a network administrator who likes installing things on their gear, especially if it's third party software.
2) Vendor buy in. ALL of the major vendors are contributing to Ansible to support their gear, because they know this is a pain point for many network admins.
The next thing to remember is to always ALWAYS build safety checks into your automation. Never assume that your default variables will be safe, and always validate that they’re some sane value.
In Ansible, this is very easy to do with conditionals.
1) Prevent a problem like the s3 outage
2) Check paths
Finally, have some best practices around your automation.
1. Try to use built-in modules before resorting to shell commands and scripts. Modules by-and large have a lot more safety built into them, so you can avoid a large class of mistakes by using them. Going back to the previous example, notice we don’t have a safety check here now. Why? Because the file module will not remove a directory by default if there are files in it. You could of course force this, in which case it would be a good idea still use the conditionals to prevent accidental removals.
2. When creating variables, make sure you prefix them with something descriptive. For example, use “apache_http_port” instead of simply “port”. For Ansible, this will make the playbooks and roles you write MUCH safer.
3. Above all else, keep it simple. Ansible makes it very easy to keep things simple, but there are some ways you can introduce some very complex language. In general, we discourage this and try to steer users towards keeping playbooks as readable as possible. The same is true of other config management systems - the more complex you make it, the harder it is for others to maintain it.