SlideShare a Scribd company logo
James Cammarata
dotScale - 2017
Source:
https://twitter.com/garybernhardt
Source:
https://www.destroyallsoftware.com/talks/wat
Watomation
Source:
Image from “The Professional (Leon)”
Comment?
Source:
https://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919/
“The key flaw that caused this outage to be so severe was
an unfortunate handling of an error condition. An
automated system for verifying configuration values
ended up causing much more damage than it fixed.
The intent of the automated system is to check for
configuration values that are invalid in the cache and
replace them with updated values from the persistent
store. This works well for a transient problem with the
cache, but it doesn’t work when the persistent store is
invalid.”
Source:
http://www.tomsitpro.com/articles/time-warner-cable-outage-internet-investigation,1-2160.html
"During an overnight network maintenance activity in
which we were managing IP addresses, an erroneous
configuration was propagated throughout our national
backbone, resulting in a network outage," according to
Time Warner.
“While exact details are sparse, the outage did occur
during maintenance activity, indicating possible human
error.”
Sources:
https://goo.gl/Akdp54 (http://www.independent.co.uk)
http://www.itpro.co.uk/networking/26363/man-who-deleted-company-with-one-line-of-code-admits-it-was-all-a-hoax
- shell: rm -rf {{path}}/{{some_file}}
Source:
https://aws.amazon.com/message/41926/
The Amazon Simple Storage Service (S3) team was debugging an issue
causing the S3 billing system to progress more slowly than expected.
At 9:37AM PST, an authorized S3 team member using an established
playbook executed a command which was intended to remove a small
number of servers for one of the S3 subsystems that is used by the S3
billing process. Unfortunately, one of the inputs to the command was
entered incorrectly and a larger set of servers was removed than
intended.
Source:
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
Manual Steps
When Amazon's Availability Zone (AZ) started failing
we decided to get out of the zone all together. This
meant making significant changes to our AWS
configuration. While we have tools to change
individual aspects of our AWS deployment and
configuration they are not currently designed to enact
wholesale changes, such as moving sets of services out
of a zone completely. This meant that we had to
engage with each of the service teams to make the
manual (and potentially error prone) changes. In the
future we will be working to automate this process, so
it will scale for a company of our size and growth rate.
Why Do These Outages
Happen,
Despite Automation
Being Used?
Why We Still Have Problems...
Our environments are becoming
increasingly complex:
1. Manual steps == human error
2. Microservices are popular, but even
simple LB/web/middleware/db setups
can have dozens of failure points
3. Failure to automate rollback/failover
Why We Still Have Problems...
A lot of older systems
exist, which have to be
interfaced with, and
generally don't provide a
lot of modern datacenter
protections.
Photo Credit:
https://www.flickr.com/photos/pargon/2444943158
So What Can We Do?
What Else Can You Automate?
If it has a remote API, you can
automate it (with Ansible).
https://github.com/jimi-c/hue
Network Ops
In 2016, almost all major internet service
outages were caused by one of two problems:
1) DDoS attacks
2) BGP configuration mistakes
Build Safety Checks In By Default
1)How could we prevent the S3 outage?
2)How could we prevent accidentally running `rm -rf /`?
- name: set number of active servers
ec2:
image: ami-123456
count: “{{number_of_servers}}”
when: number_of_servers > 10
- name: delete some path
shell: rm -rf {{some_path}}/
when: some_path is defined and some_path != “”
Other Best Practices
1) Try to use built-in modules before reverting to shell/script
commands.
2) Prefix variable names, especially for something generic like “port”,
especially when using them with Ansible roles.
3) Keep it simple.
- name: delete some path
file:
path: “{{some_path}}/”
state: absent
Merci
Beaucoup!

More Related Content

Similar to dotScale 2017 - watomation

Virtualization 2011 v1
Virtualization 2011 v1Virtualization 2011 v1
Virtualization 2011 v1
Pini Cohen
 
Cloud Economics
Cloud EconomicsCloud Economics
Cloud Economics
Chris Bailey
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And Solutions
Martin Jackson
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealing
Atul Dhingra
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
VMware Tanzu
 
Messaging is not just for investment banks!
Messaging is not just for investment banks!Messaging is not just for investment banks!
Messaging is not just for investment banks!
elliando dias
 
Introduction to Magento Optimization
Introduction to Magento OptimizationIntroduction to Magento Optimization
Introduction to Magento Optimization
Fabio Daniele
 
Microsoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture GarntsarikMicrosoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
ABTO Software
 
JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)
jaxLondonConference
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEEMEMTECHSTUDENTPROJECTS
 
Migrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotalMigrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotal
kkdlavak3
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_PivotalMigrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Estevan McCalley
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Dean Bruckman
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Tim Kirby
 
Testing web applications
Testing web applicationsTesting web applications
Testing web applications
msksaba
 
Implementing dr w. hyper v clustering
Implementing dr w. hyper v clusteringImplementing dr w. hyper v clustering
Implementing dr w. hyper v clustering
Concentrated Technology
 
ESXpert strategies VMware vSphere
ESXpert strategies VMware vSphereESXpert strategies VMware vSphere
ESXpert strategies VMware vSphere
Concentrated Technology
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
Steve Loughran
 
Creating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsCreating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applications
Gerald Versluis
 

Similar to dotScale 2017 - watomation (20)

Virtualization 2011 v1
Virtualization 2011 v1Virtualization 2011 v1
Virtualization 2011 v1
 
Cloud Economics
Cloud EconomicsCloud Economics
Cloud Economics
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And Solutions
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealing
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
 
Messaging is not just for investment banks!
Messaging is not just for investment banks!Messaging is not just for investment banks!
Messaging is not just for investment banks!
 
Introduction to Magento Optimization
Introduction to Magento OptimizationIntroduction to Magento Optimization
Introduction to Magento Optimization
 
Microsoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture GarntsarikMicrosoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
Microsoft Sync Framework (part 1) ABTO Software Lecture Garntsarik
 
JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
 
Migrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotalMigrating to cloud-native_app_architectures_pivotal
Migrating to cloud-native_app_architectures_pivotal
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_PivotalMigrating_to_Cloud-Native_App_Architectures_Pivotal
Migrating_to_Cloud-Native_App_Architectures_Pivotal
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
 
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
 
Testing web applications
Testing web applicationsTesting web applications
Testing web applications
 
Implementing dr w. hyper v clustering
Implementing dr w. hyper v clusteringImplementing dr w. hyper v clustering
Implementing dr w. hyper v clustering
 
ESXpert strategies VMware vSphere
ESXpert strategies VMware vSphereESXpert strategies VMware vSphere
ESXpert strategies VMware vSphere
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
Creating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsCreating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applications
 

Recently uploaded

Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 

Recently uploaded (20)

Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 

dotScale 2017 - watomation

Editor's Notes

  1. Final
  2. In 2012, Gary Bernhardt gave a talk at CodeMash entitled "wat". For those who may not have seen it before, the focus of this talk was this...
  3. Basically he walks through a few examples of some programming language quirks. Javascript (somewhat deservingly) gets most of the attention, as shown in this screen grab. If you haven’t watched it, I highly recommend it because it’s a very funny and memorable talk. So my coworker Greg DeKoenigsberg and I were kicking around ideas for me to use here at dotScale, and we thought it would be fun to riff off this in terms of automation, after which he coined the following term:
  4. Basically, I’d like to discuss some times that admins and operations teams were using automation in, shall we say, less than optimal ways. How many of you are using automation for somethings, but not everything? And by everything I mean testing, CI/CD, failovers, scaling up, scaling down, backup recovery -
  5. Everything (as Gary Oldman says here). There has certainly been a big push to automate things thanks to the DevOps movement. But unfortunately, automation can still cause you problems if you’re not careful.
  6. How?
  7. First up, we have an example of automation reacting in unexpected ways. In 2010, Facebook had a pretty major outage, caused by a piece of automation software they wrote that was supposed to help fix things, but really just made things quite worse. Why? Because it wasn’t designed to deal with a persistent error, only a transient one. The result was that every host started hammering the database trying to handle the problem, resulting in a self-DDoS. It’s always difficult to handle unforeseen corner-cases.
  8. Next up, we have this Time-Warner outage from 2014, in which an incorrect network configuration was automatically distributed across their network devices, resulting in a major outage that impacted 29 states in the US and 11 million customers.
  9. Just over a year ago, a guy created a Stack Exchange post, in which he claimed he accidentally wiped out his entire web server farm due to an Ansible playbook mistake. It later came out that this was all a hoax, in which he was attempting to create a viral marketing campaign for his new business, but what he essentially claimed was that by leaving the variables in the above Ansible snippet undefined, he accidentally did a recursive removal of all files on every one of his servers. A lot of people very quickly pointed out that that’s not how Ansible works, undefined variables like this would raise an error. However if you DID accidentally write an Ansible task like this and initialized or defaulted those variables to empty strings you might have a very bad day in front of you. This is not out of the realm of possibility...
  10. And then we have this, which I’m sure almost everyone here is familiar with. This is an unfortunate example of a real-world incident of the previous slide. An engineer set a variable to a value that was bigger than expected, which resulted in too many servers in that environment getting removed. Unfortunately, things had not been designed to tolerate that level of removal, so a lot of things had to be restarted and resynced, which took a LONG time. As always, computers do exactly what we tell them to do (at least usually).
  11. Problems with automation aren’t just related to user error or unforeseen conditions. Most frequently, they happen when you haven’t anticipated needing to automate certain aspects of your recovery and have to do something manually to recover. Here, the Netflix team had to deal with a major AWS outage by manually moving services to another zone, for which they had no automated process. Luckily for them, they were able to do it without impacting customers, but not everyone is so lucky.
  12. So why do these outages happen, despite automation being used?
  13. Our environments are becoming increasingly complex. 1. Manual steps == human error2. Microservices are popular, but even simple LB/web/middleware/db setups can have dozens of failure points3. Failure to automate rollback/failover
  14. 2) A lot of older systems exist, which have to be interfaced with and generally don't provide a lot of modern datacenter protections. On top of that, they’re often quite expensive. In a previous company, we had a mainframe from a company whose nickname, en Francaise est “Le Grand Bleu”. The bill for the memory alone was close to $1 million dollars. The storage used was getting close to a petabyte. That's not an easy monolithic system to fail over, even with automation. This is also illustrated by long times to do maintenance tasks, like the S3 re-indexing needed when the service was restarted.
  15. So, what can we do?
  16. Automate more. While automation doesn’t automatically solve every problem for you, it does help stop you from repeating past mistakes. Also, there are still a LOT of areas where automation is not used very much...
  17. What else can you automate? Well, if you’re unsure whether or not you can automate something, remember this: If it has a remote API, you can automate it. I can’t really speak to other systems, but with Ansible, it’s really quite easy to write modules to manage things with APIs. A talk I’ve given several times recently, including here in Paris in Feburary, was all about controlling Phillips Hue lights with Ansible. You can find the source code and playbooks for doing this at the link here, which I’ll leave up for a minute so anyone interested can take note of it.
  18. Networking is one of the most important components of any datacenter, and yet if you look at a lot of the major outages in 2016 they were caused by one of two things: 1) DDoS attacks (which we can't do much about outside of things like CloudFlare, CloudFront, etc.). 2) BGP configuration mistakes. For those who may not know, BGP is the routing protocol used to link major networks on the internet. When mistakes are made, it very quickly causes major problems on the internet at large. A little plug for Ansible here: This is by far the largest area we’ve seen Ansible expand into over the last year, even more so than Docker and Containers. Why? 1) No agents. I've never met a network administrator who likes installing things on their gear, especially if it's third party software. 2) Vendor buy in. ALL of the major vendors are contributing to Ansible to support their gear, because they know this is a pain point for many network admins.
  19. The next thing to remember is to always ALWAYS build safety checks into your automation. Never assume that your default variables will be safe, and always validate that they’re some sane value. In Ansible, this is very easy to do with conditionals. 1) Prevent a problem like the s3 outage 2) Check paths
  20. Finally, have some best practices around your automation. 1. Try to use built-in modules before resorting to shell commands and scripts. Modules by-and large have a lot more safety built into them, so you can avoid a large class of mistakes by using them. Going back to the previous example, notice we don’t have a safety check here now. Why? Because the file module will not remove a directory by default if there are files in it. You could of course force this, in which case it would be a good idea still use the conditionals to prevent accidental removals. 2. When creating variables, make sure you prefix them with something descriptive. For example, use “apache_http_port” instead of simply “port”. For Ansible, this will make the playbooks and roles you write MUCH safer. 3. Above all else, keep it simple. Ansible makes it very easy to keep things simple, but there are some ways you can introduce some very complex language. In general, we discourage this and try to steer users towards keeping playbooks as readable as possible. The same is true of other config management systems - the more complex you make it, the harder it is for others to maintain it.
  21. And remember, only you can prevent watomation!