SlideShare a Scribd company logo
AWS Outage / Availability zone failure
in
Sydney region
- 05th June 2016 -
Author: Gilles Baillet
* Disclaimer: The opinions expressed in this presentation are the author's own and do not reflect the view of his employer
Who am I?
Gilles Baillet
Cloud Centre of Excellence Manager
Leading a team of 5 DevOps engineers on the Ops (dark) side of DevOps
AWS Certified SysOps Associate and Solutions Architect Associate
Fan of DevOps, AWS, Data Pipeline and now Lambda
Food, drinks travel-addict and almost married!
You can meet me at several meetups around Sydney (AWS, Docker, Elastic)
You can connect with me on LinkedIn: https://au.linkedin.com/in/gillesbaillet I accept connections from (almost)
everyone!
Before we start
Availability Zone Alignment
Randomisation of the assignment of AZs across AWS accounts
Our AZ are “aligned” across all our production and non-production accounts
Tip: Talk to your TAM!
The chain of events as presented by AWS
• At 3:25PM AEST: loss of power at a regional substation
• At 4:46PM AEST: power restored
• At 6:00PM AEST: over 80% of impacted services back online
• At 1:00AM AEST: nearly all instances recovered
• TOTAL DURATION: 1h21 / 9h35
http://aws.amazon.com/message/4372T8/
The chain of events as experienced by my company
• At 3:25PM AEST: trigger of monitoring/alerting services
• At 3:30PM AEST: conference bridge opened
• At 5:30PM AEST: most services were restored
• At 3:00AM AEST: all production services were restored
• TOTAL DURATION: 2h05 / 11h35
Black Swan
“An event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact
with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not
exist, but the saying was rewritten after black swans were discovered in the wild”
https://en.wikipedia.org/wiki/Black_swan_theory
Taleb, N. N. (2007). The black swan: The impact of the highly improbable. Random house.
Impact during the outage
• all services running in the impacted AZ
• some Auto Scaling Group processes
• a NIC failure at 3:26PM
Instance restarted
No ELB health checks
Healthy instance marked as unhealthy
• EC2 Console / EC2 CLI commands
• Some CloudWatch metrics
• Some services relying on a single instance of a service (eg. domain controller)
Impact after the outage
• DB repair / integrity check
• Restoration of data stored on ephemeral storage
• 24 hours fixing instances in lower environments (DEV, UAT etc.)
• Clean up of rogue instances
Some things did work
• ELB Health checks
• RDS Database failover
• Some Auto Scaling Group processes
• AWS support escalation
• All critical services running on Cloud 2.0!
Lesson learned
• Implementation vs design
• Instance type matters
• AWS Enterprise support is worth the cost
• Cattle are awesome
• Datacenters in Sydney are not weather proof
• 100s of companies impacted
What’s next?
• Review of design documents vs implementation
• Use older instance types
• Use Chaos Monkey
• Turn Pets into Cattle (more work for my team!)
• Deploy new VPCs across 3 AZs
• Revisit DNS client TTL versus Health Check timeout
• AWS to fix “things” on their end
Questions?

More Related Content

What's hot

Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike CialowiczTaking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
Atlassian
 
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
Amazon Web Services
 
Infrastructure Automation on AWS using a Real-World Customer Example
Infrastructure Automation on AWS using a Real-World Customer ExampleInfrastructure Automation on AWS using a Real-World Customer Example
Infrastructure Automation on AWS using a Real-World Customer Example
API Talent
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Cohesive Networks
 
Lessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack CloudsLessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack Clouds
Kenneth Hui
 
Chicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveChicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at Cohesive
CloudCamp Chicago
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using Datadog
Mukta Aphale
 
Datadog- Monitoring In Motion
Datadog- Monitoring In Motion Datadog- Monitoring In Motion
Datadog- Monitoring In Motion
Cloud Native Apps SF
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
Dianne Marsh
 
DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop
Sascha Möllering
 
HA SOA Application with GlusterFS
HA SOA Application with GlusterFSHA SOA Application with GlusterFS
HA SOA Application with GlusterFS
zeridon
 
Sas 2015 event_driven
Sas 2015 event_drivenSas 2015 event_driven
Sas 2015 event_driven
Sascha Möllering
 
Is serverless the new swiss cheese? ServerlessDays NYC 2018
Is serverless the new swiss cheese? ServerlessDays NYC 2018Is serverless the new swiss cheese? ServerlessDays NYC 2018
Is serverless the new swiss cheese? ServerlessDays NYC 2018
Chase Douglas
 
MongoDB.local Berlin: Atlas for your Enterprise
MongoDB.local Berlin: Atlas for your EnterpriseMongoDB.local Berlin: Atlas for your Enterprise
MongoDB.local Berlin: Atlas for your Enterprise
MongoDB
 
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CDDevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps_Fest
 
AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)
Amazon Web Services
 
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data CentreGlobal Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
kieranjacobsen
 
Akkurate Akka
Akkurate AkkaAkkurate Akka
Akkurate Akka
Yurii Ostapchuk
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
Grig Gheorghiu
 
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
Amazon Web Services
 

What's hot (20)

Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike CialowiczTaking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
 
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
 
Infrastructure Automation on AWS using a Real-World Customer Example
Infrastructure Automation on AWS using a Real-World Customer ExampleInfrastructure Automation on AWS using a Real-World Customer Example
Infrastructure Automation on AWS using a Real-World Customer Example
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
 
Lessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack CloudsLessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack Clouds
 
Chicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveChicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at Cohesive
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using Datadog
 
Datadog- Monitoring In Motion
Datadog- Monitoring In Motion Datadog- Monitoring In Motion
Datadog- Monitoring In Motion
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
 
DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop
 
HA SOA Application with GlusterFS
HA SOA Application with GlusterFSHA SOA Application with GlusterFS
HA SOA Application with GlusterFS
 
Sas 2015 event_driven
Sas 2015 event_drivenSas 2015 event_driven
Sas 2015 event_driven
 
Is serverless the new swiss cheese? ServerlessDays NYC 2018
Is serverless the new swiss cheese? ServerlessDays NYC 2018Is serverless the new swiss cheese? ServerlessDays NYC 2018
Is serverless the new swiss cheese? ServerlessDays NYC 2018
 
MongoDB.local Berlin: Atlas for your Enterprise
MongoDB.local Berlin: Atlas for your EnterpriseMongoDB.local Berlin: Atlas for your Enterprise
MongoDB.local Berlin: Atlas for your Enterprise
 
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CDDevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
 
AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)
 
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data CentreGlobal Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
 
Akkurate Akka
Akkurate AkkaAkkurate Akka
Akkurate Akka
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
 

Viewers also liked

AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
Amazon Web Services
 
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar SeriesStrategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
Amazon Web Services
 
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar SeriesDeep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Amazon Web Services
 
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
Amazon Web Services
 
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
Amazon Web Services
 
Deep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout SessionsDeep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout Sessions
Amazon Web Services
 
Incident Coordination Workshop
Incident Coordination WorkshopIncident Coordination Workshop
Incident Coordination Workshop
Amazon Web Services
 
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
Amazon Web Services
 
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
Amazon Web Services
 
(SEC301) Strategies for Protecting Data Using Encryption in AWS
(SEC301) Strategies for Protecting Data Using Encryption in AWS(SEC301) Strategies for Protecting Data Using Encryption in AWS
(SEC301) Strategies for Protecting Data Using Encryption in AWS
Amazon Web Services
 
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
Amazon Web Services
 

Viewers also liked (11)

AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
 
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar SeriesStrategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
 
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar SeriesDeep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
 
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
 
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
 
Deep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout SessionsDeep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout Sessions
 
Incident Coordination Workshop
Incident Coordination WorkshopIncident Coordination Workshop
Incident Coordination Workshop
 
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
 
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
 
(SEC301) Strategies for Protecting Data Using Encryption in AWS
(SEC301) Strategies for Protecting Data Using Encryption in AWS(SEC301) Strategies for Protecting Data Using Encryption in AWS
(SEC301) Strategies for Protecting Data Using Encryption in AWS
 
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
 

Similar to What we learned from the AWS Outage

Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Amazon Web Services
 
AWS Black Belt Tips
AWS Black Belt TipsAWS Black Belt Tips
AWS Black Belt Tips
Amazon Web Services
 
AWS Black Belt Tips
AWS Black Belt TipsAWS Black Belt Tips
AWS Black Belt Tips
Amazon Web Services
 
T1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsT1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsAmazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
How to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWSHow to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWS
Blazeclan Technologies Private Limited
 
(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users
Amazon Web Services
 
Deep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million UsersDeep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million Users
Amazon Web Services
 
From AWS to Series A in 5 Easy Pieces
From AWS to Series A in 5 Easy PiecesFrom AWS to Series A in 5 Easy Pieces
From AWS to Series A in 5 Easy Pieces
Amazon Web Services
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
What is Amazon Web Services & How to Start to deploy your apps ?
What is Amazon Web Services & How to Start to deploy your apps ?What is Amazon Web Services & How to Start to deploy your apps ?
What is Amazon Web Services & How to Start to deploy your apps ?
Sébastien ☁ Stormacq
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019
Erik Peterson
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
Amazon Web Services
 
ENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million users
Amazon Web Services
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
Aplicaciones a gran escala: Cómo servir a millones de usuarios
Aplicaciones a gran escala: Cómo servir a millones de usuariosAplicaciones a gran escala: Cómo servir a millones de usuarios
Aplicaciones a gran escala: Cómo servir a millones de usuarios
Amazon Web Services
 
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
Amazon Web Services
 

Similar to What we learned from the AWS Outage (20)

Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
 
AWS Black Belt Tips
AWS Black Belt TipsAWS Black Belt Tips
AWS Black Belt Tips
 
AWS Black Belt Tips
AWS Black Belt TipsAWS Black Belt Tips
AWS Black Belt Tips
 
T1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsT1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on aws
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
How to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWSHow to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWS
 
(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users
 
Deep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million UsersDeep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million Users
 
From AWS to Series A in 5 Easy Pieces
From AWS to Series A in 5 Easy PiecesFrom AWS to Series A in 5 Easy Pieces
From AWS to Series A in 5 Easy Pieces
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
What is Amazon Web Services & How to Start to deploy your apps ?
What is Amazon Web Services & How to Start to deploy your apps ?What is Amazon Web Services & How to Start to deploy your apps ?
What is Amazon Web Services & How to Start to deploy your apps ?
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
 
ENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million users
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Aplicaciones a gran escala: Cómo servir a millones de usuarios
Aplicaciones a gran escala: Cómo servir a millones de usuariosAplicaciones a gran escala: Cómo servir a millones de usuarios
Aplicaciones a gran escala: Cómo servir a millones de usuarios
 
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
 

More from PolarSeven Pty Ltd

AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
PolarSeven Pty Ltd
 
Aws user group #04 landing zones
Aws user group #04   landing zonesAws user group #04   landing zones
Aws user group #04 landing zones
PolarSeven Pty Ltd
 
Aws user group #03 - All things Iot
Aws user group #03 - All things IotAws user group #03 - All things Iot
Aws user group #03 - All things Iot
PolarSeven Pty Ltd
 
Aws user group #01 lets talk serverless
Aws user group #01   lets talk serverlessAws user group #01   lets talk serverless
Aws user group #01 lets talk serverless
PolarSeven Pty Ltd
 
AWS Reinvent Recap 2018
AWS Reinvent Recap 2018 AWS Reinvent Recap 2018
AWS Reinvent Recap 2018
PolarSeven Pty Ltd
 
AWS User Group October
AWS User Group OctoberAWS User Group October
AWS User Group October
PolarSeven Pty Ltd
 
AWS User Group August
AWS User Group AugustAWS User Group August
AWS User Group August
PolarSeven Pty Ltd
 
AWS User Group November
AWS User Group NovemberAWS User Group November
AWS User Group November
PolarSeven Pty Ltd
 
AWS User Group September
AWS User Group September AWS User Group September
AWS User Group September
PolarSeven Pty Ltd
 
Amazon Web Services User Group Sydney - March 2018
Amazon Web Services User Group Sydney - March 2018Amazon Web Services User Group Sydney - March 2018
Amazon Web Services User Group Sydney - March 2018
PolarSeven Pty Ltd
 
Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018
PolarSeven Pty Ltd
 
Deep Dive on Cloud Policies and Automation
Deep Dive on Cloud Policies and AutomationDeep Dive on Cloud Policies and Automation
Deep Dive on Cloud Policies and Automation
PolarSeven Pty Ltd
 
Securing Traffic Leaving A VPC
Securing Traffic Leaving A VPCSecuring Traffic Leaving A VPC
Securing Traffic Leaving A VPC
PolarSeven Pty Ltd
 
Telstra Programmable Networks & Scaling a Serverless Team with Automation
 Telstra Programmable Networks & Scaling a Serverless Team with Automation Telstra Programmable Networks & Scaling a Serverless Team with Automation
Telstra Programmable Networks & Scaling a Serverless Team with Automation
PolarSeven Pty Ltd
 
AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60
PolarSeven Pty Ltd
 
Shared Security in AWS
Shared Security in AWSShared Security in AWS
Shared Security in AWS
PolarSeven Pty Ltd
 
Visibility, Optimization & Governance for Cloud Services
Visibility, Optimization & Governance for Cloud ServicesVisibility, Optimization & Governance for Cloud Services
Visibility, Optimization & Governance for Cloud Services
PolarSeven Pty Ltd
 
AWS OpsWorks for Chef Automate
AWS OpsWorks for Chef AutomateAWS OpsWorks for Chef Automate
AWS OpsWorks for Chef Automate
PolarSeven Pty Ltd
 
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
PolarSeven Pty Ltd
 
AWS User Group December 2016
AWS User Group December 2016AWS User Group December 2016
AWS User Group December 2016
PolarSeven Pty Ltd
 

More from PolarSeven Pty Ltd (20)

AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
 
Aws user group #04 landing zones
Aws user group #04   landing zonesAws user group #04   landing zones
Aws user group #04 landing zones
 
Aws user group #03 - All things Iot
Aws user group #03 - All things IotAws user group #03 - All things Iot
Aws user group #03 - All things Iot
 
Aws user group #01 lets talk serverless
Aws user group #01   lets talk serverlessAws user group #01   lets talk serverless
Aws user group #01 lets talk serverless
 
AWS Reinvent Recap 2018
AWS Reinvent Recap 2018 AWS Reinvent Recap 2018
AWS Reinvent Recap 2018
 
AWS User Group October
AWS User Group OctoberAWS User Group October
AWS User Group October
 
AWS User Group August
AWS User Group AugustAWS User Group August
AWS User Group August
 
AWS User Group November
AWS User Group NovemberAWS User Group November
AWS User Group November
 
AWS User Group September
AWS User Group September AWS User Group September
AWS User Group September
 
Amazon Web Services User Group Sydney - March 2018
Amazon Web Services User Group Sydney - March 2018Amazon Web Services User Group Sydney - March 2018
Amazon Web Services User Group Sydney - March 2018
 
Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018
 
Deep Dive on Cloud Policies and Automation
Deep Dive on Cloud Policies and AutomationDeep Dive on Cloud Policies and Automation
Deep Dive on Cloud Policies and Automation
 
Securing Traffic Leaving A VPC
Securing Traffic Leaving A VPCSecuring Traffic Leaving A VPC
Securing Traffic Leaving A VPC
 
Telstra Programmable Networks & Scaling a Serverless Team with Automation
 Telstra Programmable Networks & Scaling a Serverless Team with Automation Telstra Programmable Networks & Scaling a Serverless Team with Automation
Telstra Programmable Networks & Scaling a Serverless Team with Automation
 
AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60
 
Shared Security in AWS
Shared Security in AWSShared Security in AWS
Shared Security in AWS
 
Visibility, Optimization & Governance for Cloud Services
Visibility, Optimization & Governance for Cloud ServicesVisibility, Optimization & Governance for Cloud Services
Visibility, Optimization & Governance for Cloud Services
 
AWS OpsWorks for Chef Automate
AWS OpsWorks for Chef AutomateAWS OpsWorks for Chef Automate
AWS OpsWorks for Chef Automate
 
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
 
AWS User Group December 2016
AWS User Group December 2016AWS User Group December 2016
AWS User Group December 2016
 

Recently uploaded

This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 

Recently uploaded (20)

This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 

What we learned from the AWS Outage

  • 1. AWS Outage / Availability zone failure in Sydney region - 05th June 2016 - Author: Gilles Baillet * Disclaimer: The opinions expressed in this presentation are the author's own and do not reflect the view of his employer
  • 2. Who am I? Gilles Baillet Cloud Centre of Excellence Manager Leading a team of 5 DevOps engineers on the Ops (dark) side of DevOps AWS Certified SysOps Associate and Solutions Architect Associate Fan of DevOps, AWS, Data Pipeline and now Lambda Food, drinks travel-addict and almost married! You can meet me at several meetups around Sydney (AWS, Docker, Elastic) You can connect with me on LinkedIn: https://au.linkedin.com/in/gillesbaillet I accept connections from (almost) everyone!
  • 3. Before we start Availability Zone Alignment Randomisation of the assignment of AZs across AWS accounts Our AZ are “aligned” across all our production and non-production accounts Tip: Talk to your TAM!
  • 4. The chain of events as presented by AWS • At 3:25PM AEST: loss of power at a regional substation • At 4:46PM AEST: power restored • At 6:00PM AEST: over 80% of impacted services back online • At 1:00AM AEST: nearly all instances recovered • TOTAL DURATION: 1h21 / 9h35 http://aws.amazon.com/message/4372T8/
  • 5. The chain of events as experienced by my company • At 3:25PM AEST: trigger of monitoring/alerting services • At 3:30PM AEST: conference bridge opened • At 5:30PM AEST: most services were restored • At 3:00AM AEST: all production services were restored • TOTAL DURATION: 2h05 / 11h35
  • 6. Black Swan “An event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not exist, but the saying was rewritten after black swans were discovered in the wild” https://en.wikipedia.org/wiki/Black_swan_theory Taleb, N. N. (2007). The black swan: The impact of the highly improbable. Random house.
  • 7. Impact during the outage • all services running in the impacted AZ • some Auto Scaling Group processes • a NIC failure at 3:26PM Instance restarted No ELB health checks Healthy instance marked as unhealthy • EC2 Console / EC2 CLI commands • Some CloudWatch metrics • Some services relying on a single instance of a service (eg. domain controller)
  • 8. Impact after the outage • DB repair / integrity check • Restoration of data stored on ephemeral storage • 24 hours fixing instances in lower environments (DEV, UAT etc.) • Clean up of rogue instances
  • 9. Some things did work • ELB Health checks • RDS Database failover • Some Auto Scaling Group processes • AWS support escalation • All critical services running on Cloud 2.0!
  • 10. Lesson learned • Implementation vs design • Instance type matters • AWS Enterprise support is worth the cost • Cattle are awesome • Datacenters in Sydney are not weather proof • 100s of companies impacted
  • 11. What’s next? • Review of design documents vs implementation • Use older instance types • Use Chaos Monkey • Turn Pets into Cattle (more work for my team!) • Deploy new VPCs across 3 AZs • Revisit DNS client TTL versus Health Check timeout • AWS to fix “things” on their end

Editor's Notes

  1. Impactful event but fantastic opportunity to prove or disprove some design decisions and assess the impact of such events on our services to make them more resilient and keep our customers happy.
  2. Cloud 2.0 Cattle French (for those…) So good food and good drinks are always on the table Last but not least, soon a married man Poll Who is familiar with the difference between Pets and Cattle? Who is running pets? Who is running cattle? Who has been impacted?
  3. Access to both primary and secondary power lost as a result of a failure to transfer the load to generators.
  4. by forcing passive services in ap-southeast-2b to become active by temporarily removing some of our dependencies by activating/ implementing kill switches
  5. A complete datacentre failure = event that most people think can’t happen 1700 – All swans are white Until the discovery of Black Swans in WA
  6. Linux – DNS timeout of 5 sec /etc/resolv.conf was showing the unavailable DNS server first in the list ELB health check timeout = 5 sec Failure of API making us blind on what was happening on the infrastructure DNS was failing as some services failed as a result of not being able to reach their RDS database That correlates with AWS description of the behaviour of their infrastructure: When the APIs initially recovered, our systems were delayed in propagating some state changes and making them available via describe API calls. This meant that some customers could not see their newly launched resources, and some existing instances appeared as stuck in pending or shutting down when customers tried to make changes to their infrastructure in the affected Availability Zone. These state delays also increased latency of adding new instances to existing Elastic Load Balancing (ELB) load balancers.
  7. That correlates with AWS description of the behaviour of their infrastructure: When the APIs initially recovered, our systems were delayed in propagating some state changes and making them available via describe API calls. This meant that some customers could not see their newly launched resources, and some existing instances appeared as stuck in pending or shutting down when customers tried to make changes to their infrastructure in the affected Availability Zone. These state delays also increased latency of adding new instances to existing Elastic Load Balancing (ELB) load balancers.
  8. Re-designing the VPC structure across 3 availability zones in a challenge in itself as current subnets use the whole IP range for the VPC