SlideShare a Scribd company logo
1 of 22
James Cammarata
dotScale - 2017
Source:
https://twitter.com/garybernhardt
Source:
https://www.destroyallsoftware.com/talks/wat
Watomation
Source:
Image from “The Professional (Leon)”
Comment?
Source:
https://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919/
“The key flaw that caused this outage to be so severe was
an unfortunate handling of an error condition. An
automated system for verifying configuration values
ended up causing much more damage than it fixed.
The intent of the automated system is to check for
configuration values that are invalid in the cache and
replace them with updated values from the persistent
store. This works well for a transient problem with the
cache, but it doesn’t work when the persistent store is
invalid.”
Source:
http://www.tomsitpro.com/articles/time-warner-cable-outage-internet-investigation,1-2160.html
"During an overnight network maintenance activity in
which we were managing IP addresses, an erroneous
configuration was propagated throughout our national
backbone, resulting in a network outage," according to
Time Warner.
“While exact details are sparse, the outage did occur
during maintenance activity, indicating possible human
error.”
Sources:
https://goo.gl/Akdp54 (http://www.independent.co.uk)
http://www.itpro.co.uk/networking/26363/man-who-deleted-company-with-one-line-of-code-admits-it-was-all-a-hoax
- shell: rm -rf {{path}}/{{some_file}}
Source:
https://aws.amazon.com/message/41926/
The Amazon Simple Storage Service (S3) team was debugging an issue
causing the S3 billing system to progress more slowly than expected.
At 9:37AM PST, an authorized S3 team member using an established
playbook executed a command which was intended to remove a small
number of servers for one of the S3 subsystems that is used by the S3
billing process. Unfortunately, one of the inputs to the command was
entered incorrectly and a larger set of servers was removed than
intended.
Source:
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
Manual Steps
When Amazon's Availability Zone (AZ) started failing
we decided to get out of the zone all together. This
meant making significant changes to our AWS
configuration. While we have tools to change
individual aspects of our AWS deployment and
configuration they are not currently designed to enact
wholesale changes, such as moving sets of services out
of a zone completely. This meant that we had to
engage with each of the service teams to make the
manual (and potentially error prone) changes. In the
future we will be working to automate this process, so
it will scale for a company of our size and growth rate.
Why Do These Outages
Happen,
Despite Automation
Being Used?
Why We Still Have Problems...
Our environments are becoming
increasingly complex:
1. Manual steps == human error
2. Microservices are popular, but even
simple LB/web/middleware/db setups
can have dozens of failure points
3. Failure to automate rollback/failover
Why We Still Have Problems...
A lot of older systems
exist, which have to be
interfaced with, and
generally don't provide a
lot of modern datacenter
protections.
Photo Credit:
https://www.flickr.com/photos/pargon/2444943158
So What Can We Do?
What Else Can You Automate?
If it has a remote API, you can
automate it (with Ansible).
https://github.com/jimi-c/hue
Network Ops
In 2016, almost all major internet service
outages were caused by one of two problems:
1) DDoS attacks
2) BGP configuration mistakes
Build Safety Checks In By Default
1)How could we prevent the S3 outage?
2)How could we prevent accidentally running `rm -rf /`?
- name: set number of active servers
ec2:
image: ami-123456
count: “{{number_of_servers}}”
when: number_of_servers > 10
- name: delete some path
shell: rm -rf {{some_path}}/
when: some_path is defined and some_path != “”
Other Best Practices
1) Try to use built-in modules before reverting to shell/script
commands.
2) Prefix variable names, especially for something generic like “port”,
especially when using them with Ansible roles.
3) Keep it simple.
- name: delete some path
file:
path: “{{some_path}}/”
state: absent
Merci
Beaucoup!

More Related Content

Similar to dotScale 2017 - watomation

Virtualization 2011 v1
Virtualization 2011 v1Virtualization 2011 v1
Virtualization 2011 v1Pini Cohen
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And SolutionsMartin Jackson
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealingAtul Dhingra
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)VMware Tanzu
 
Messaging is not just for investment banks!
Messaging is not just for investment banks!Messaging is not just for investment banks!
Messaging is not just for investment banks!elliando dias
 
Introduction to Magento Optimization
Introduction to Magento OptimizationIntroduction to Magento Optimization
Introduction to Magento OptimizationFabio Daniele
 
Microsoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture GarntsarikMicrosoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture GarntsarikABTO Software
 
JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)jaxLondonConference
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...IEEEMEMTECHSTUDENTPROJECTS
 
Migrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotalMigrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotalkkdlavak3
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Tim Kirby
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_PivotalMigrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_PivotalEstevan McCalley
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Dean Bruckman
 
Testing web applications
Testing web applicationsTesting web applications
Testing web applicationsmsksaba
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go BadSteve Loughran
 
Creating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsCreating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsGerald Versluis
 

Similar to dotScale 2017 - watomation (20)

Virtualization 2011 v1
Virtualization 2011 v1Virtualization 2011 v1
Virtualization 2011 v1
 
Cloud Economics
Cloud EconomicsCloud Economics
Cloud Economics
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And Solutions
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealing
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
 
Messaging is not just for investment banks!
Messaging is not just for investment banks!Messaging is not just for investment banks!
Messaging is not just for investment banks!
 
Introduction to Magento Optimization
Introduction to Magento OptimizationIntroduction to Magento Optimization
Introduction to Magento Optimization
 
Microsoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture GarntsarikMicrosoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
 
JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
 
Migrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotalMigrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotal
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_PivotalMigrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_Pivotal
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
 
Testing web applications
Testing web applicationsTesting web applications
Testing web applications
 
Implementing dr w. hyper v clustering
Implementing dr w. hyper v clusteringImplementing dr w. hyper v clustering
Implementing dr w. hyper v clustering
 
ESXpert strategies VMware vSphere
ESXpert strategies VMware vSphereESXpert strategies VMware vSphere
ESXpert strategies VMware vSphere
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
Creating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsCreating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applications
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

dotScale 2017 - watomation

Editor's Notes

  1. Final
  2. In 2012, Gary Bernhardt gave a talk at CodeMash entitled "wat". For those who may not have seen it before, the focus of this talk was this...
  3. Basically he walks through a few examples of some programming language quirks. Javascript (somewhat deservingly) gets most of the attention, as shown in this screen grab. If you haven’t watched it, I highly recommend it because it’s a very funny and memorable talk. So my coworker Greg DeKoenigsberg and I were kicking around ideas for me to use here at dotScale, and we thought it would be fun to riff off this in terms of automation, after which he coined the following term:
  4. Basically, I’d like to discuss some times that admins and operations teams were using automation in, shall we say, less than optimal ways. How many of you are using automation for somethings, but not everything? And by everything I mean testing, CI/CD, failovers, scaling up, scaling down, backup recovery -
  5. Everything (as Gary Oldman says here). There has certainly been a big push to automate things thanks to the DevOps movement. But unfortunately, automation can still cause you problems if you’re not careful.
  6. How?
  7. First up, we have an example of automation reacting in unexpected ways. In 2010, Facebook had a pretty major outage, caused by a piece of automation software they wrote that was supposed to help fix things, but really just made things quite worse. Why? Because it wasn’t designed to deal with a persistent error, only a transient one. The result was that every host started hammering the database trying to handle the problem, resulting in a self-DDoS. It’s always difficult to handle unforeseen corner-cases.
  8. Next up, we have this Time-Warner outage from 2014, in which an incorrect network configuration was automatically distributed across their network devices, resulting in a major outage that impacted 29 states in the US and 11 million customers.
  9. Just over a year ago, a guy created a Stack Exchange post, in which he claimed he accidentally wiped out his entire web server farm due to an Ansible playbook mistake. It later came out that this was all a hoax, in which he was attempting to create a viral marketing campaign for his new business, but what he essentially claimed was that by leaving the variables in the above Ansible snippet undefined, he accidentally did a recursive removal of all files on every one of his servers. A lot of people very quickly pointed out that that’s not how Ansible works, undefined variables like this would raise an error. However if you DID accidentally write an Ansible task like this and initialized or defaulted those variables to empty strings you might have a very bad day in front of you. This is not out of the realm of possibility...
  10. And then we have this, which I’m sure almost everyone here is familiar with. This is an unfortunate example of a real-world incident of the previous slide. An engineer set a variable to a value that was bigger than expected, which resulted in too many servers in that environment getting removed. Unfortunately, things had not been designed to tolerate that level of removal, so a lot of things had to be restarted and resynced, which took a LONG time. As always, computers do exactly what we tell them to do (at least usually).
  11. Problems with automation aren’t just related to user error or unforeseen conditions. Most frequently, they happen when you haven’t anticipated needing to automate certain aspects of your recovery and have to do something manually to recover. Here, the Netflix team had to deal with a major AWS outage by manually moving services to another zone, for which they had no automated process. Luckily for them, they were able to do it without impacting customers, but not everyone is so lucky.
  12. So why do these outages happen, despite automation being used?
  13. Our environments are becoming increasingly complex. 1. Manual steps == human error2. Microservices are popular, but even simple LB/web/middleware/db setups can have dozens of failure points3. Failure to automate rollback/failover
  14. 2) A lot of older systems exist, which have to be interfaced with and generally don't provide a lot of modern datacenter protections. On top of that, they’re often quite expensive. In a previous company, we had a mainframe from a company whose nickname, en Francaise est “Le Grand Bleu”. The bill for the memory alone was close to $1 million dollars. The storage used was getting close to a petabyte. That's not an easy monolithic system to fail over, even with automation. This is also illustrated by long times to do maintenance tasks, like the S3 re-indexing needed when the service was restarted.
  15. So, what can we do?
  16. Automate more. While automation doesn’t automatically solve every problem for you, it does help stop you from repeating past mistakes. Also, there are still a LOT of areas where automation is not used very much...
  17. What else can you automate? Well, if you’re unsure whether or not you can automate something, remember this: If it has a remote API, you can automate it. I can’t really speak to other systems, but with Ansible, it’s really quite easy to write modules to manage things with APIs. A talk I’ve given several times recently, including here in Paris in Feburary, was all about controlling Phillips Hue lights with Ansible. You can find the source code and playbooks for doing this at the link here, which I’ll leave up for a minute so anyone interested can take note of it.
  18. Networking is one of the most important components of any datacenter, and yet if you look at a lot of the major outages in 2016 they were caused by one of two things: 1) DDoS attacks (which we can't do much about outside of things like CloudFlare, CloudFront, etc.). 2) BGP configuration mistakes. For those who may not know, BGP is the routing protocol used to link major networks on the internet. When mistakes are made, it very quickly causes major problems on the internet at large. A little plug for Ansible here: This is by far the largest area we’ve seen Ansible expand into over the last year, even more so than Docker and Containers. Why? 1) No agents. I've never met a network administrator who likes installing things on their gear, especially if it's third party software. 2) Vendor buy in. ALL of the major vendors are contributing to Ansible to support their gear, because they know this is a pain point for many network admins.
  19. The next thing to remember is to always ALWAYS build safety checks into your automation. Never assume that your default variables will be safe, and always validate that they’re some sane value. In Ansible, this is very easy to do with conditionals. 1) Prevent a problem like the s3 outage 2) Check paths
  20. Finally, have some best practices around your automation. 1. Try to use built-in modules before resorting to shell commands and scripts. Modules by-and large have a lot more safety built into them, so you can avoid a large class of mistakes by using them. Going back to the previous example, notice we don’t have a safety check here now. Why? Because the file module will not remove a directory by default if there are files in it. You could of course force this, in which case it would be a good idea still use the conditionals to prevent accidental removals. 2. When creating variables, make sure you prefix them with something descriptive. For example, use “apache_http_port” instead of simply “port”. For Ansible, this will make the playbooks and roles you write MUCH safer. 3. Above all else, keep it simple. Ansible makes it very easy to keep things simple, but there are some ways you can introduce some very complex language. In general, we discourage this and try to steer users towards keeping playbooks as readable as possible. The same is true of other config management systems - the more complex you make it, the harder it is for others to maintain it.
  21. And remember, only you can prevent watomation!