dotScale 2017 - watomation

James Cammarata
dotScale - 2017

Source:
https://twitter.com/garybernhardt

Source:
https://www.destroyallsoftware.com/talks/wat

Source:
Image from “The Professional (Leon)”

Source:
https://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919/
“The key flaw that caused this outage to be so severe was
an unfortunate handling of an error condition. An
automated system for verifying configuration values
ended up causing much more damage than it fixed.
The intent of the automated system is to check for
configuration values that are invalid in the cache and
replace them with updated values from the persistent
store. This works well for a transient problem with the
cache, but it doesn’t work when the persistent store is
invalid.”

Source:
http://www.tomsitpro.com/articles/time-warner-cable-outage-internet-investigation,1-2160.html
"During an overnight network maintenance activity in
which we were managing IP addresses, an erroneous
configuration was propagated throughout our national
backbone, resulting in a network outage," according to
Time Warner.
“While exact details are sparse, the outage did occur
during maintenance activity, indicating possible human
error.”

Sources:
https://goo.gl/Akdp54 (http://www.independent.co.uk)
http://www.itpro.co.uk/networking/26363/man-who-deleted-company-with-one-line-of-code-admits-it-was-all-a-hoax
- shell: rm -rf {{path}}/{{some_file}}

Source:
https://aws.amazon.com/message/41926/
The Amazon Simple Storage Service (S3) team was debugging an issue
causing the S3 billing system to progress more slowly than expected.
At 9:37AM PST, an authorized S3 team member using an established
playbook executed a command which was intended to remove a small
number of servers for one of the S3 subsystems that is used by the S3
billing process. Unfortunately, one of the inputs to the command was
entered incorrectly and a larger set of servers was removed than
intended.

Source:
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
Manual Steps
When Amazon's Availability Zone (AZ) started failing
we decided to get out of the zone all together. This
meant making significant changes to our AWS
configuration. While we have tools to change
individual aspects of our AWS deployment and
configuration they are not currently designed to enact
wholesale changes, such as moving sets of services out
of a zone completely. This meant that we had to
engage with each of the service teams to make the
manual (and potentially error prone) changes. In the
future we will be working to automate this process, so
it will scale for a company of our size and growth rate.

Why Do These Outages
Happen,
Despite Automation
Being Used?

Why We Still Have Problems...
Our environments are becoming
increasingly complex:
1. Manual steps == human error
2. Microservices are popular, but even
simple LB/web/middleware/db setups
can have dozens of failure points
3. Failure to automate rollback/failover

Why We Still Have Problems...
A lot of older systems
exist, which have to be
interfaced with, and
generally don't provide a
lot of modern datacenter
protections.
Photo Credit:
https://www.flickr.com/photos/pargon/2444943158

What Else Can You Automate?
If it has a remote API, you can
automate it (with Ansible).
https://github.com/jimi-c/hue

Network Ops
In 2016, almost all major internet service
outages were caused by one of two problems:
1) DDoS attacks
2) BGP configuration mistakes

Build Safety Checks In By Default
1)How could we prevent the S3 outage?
2)How could we prevent accidentally running `rm -rf /`?
- name: set number of active servers
ec2:
image: ami-123456
count: “{{number_of_servers}}”
when: number_of_servers > 10
- name: delete some path
shell: rm -rf {{some_path}}/
when: some_path is defined and some_path != “”

Other Best Practices
1) Try to use built-in modules before reverting to shell/script
commands.
2) Prefix variable names, especially for something generic like “port”,
especially when using them with Ansible roles.
3) Keep it simple.
- name: delete some path
file:
path: “{{some_path}}/”
state: absent

dotScale 2017 - watomation

Recommended

Recommended

More Related Content

Similar to dotScale 2017 - watomation

Similar to dotScale 2017 - watomation (20)

Recently uploaded

Recently uploaded (20)

dotScale 2017 - watomation

Editor's Notes