SlideShare a Scribd company logo
AMAZON FAIL
      DC Public Library’s Lessons Learned from the Amazon Cloud Outage




Friday, June 24, 2011
BACKGROUND

    •   DClibrary.org was first major DC Government website to use
        cloud-based hosting beginning circa June 2009

    •   Initial architecture designed to leverage low cost of large instances
        Amazon Web Services (AWS) servers for database operations
        and lower cost small and mid servers for WWW services

    •   DClibrary.org Content Management System is Drupal 6

    •   Bonus: Experimental Drupal 7 amazon machine instance available
        on our website; currently undergoing user testing

Friday, June 24, 2011
WHAT WENT WRONG

    •   Background: AWS de-couples the physical hard disk space (called Elastic
        Block Storage or EBS) from the CPUs (called “compute instances”)

    •   late April 2011: an AWS engineer mistakenly routed “backplane” (internal
        server traffic) which connects EBS to the CPUS through a system that could
        not handle the load

    •   This triggered an alarm; since everything in AWS is redundant, the systems
        thought the backup EBS drives had all failed simultaneously, causing an
        overload as the system tried to compensate

    •   In a nutshell, it’s almost as if the CPUs no longer had hard drives

Friday, June 24, 2011
2009 ARCHITECTURE

    •   June 2009 architecture focused
        on load balancing and database
        replication across Amazon
        Availability Zones

    •   SVN machine was also in cloud

    •   Too reliant on one service
        provider (amazon)


Friday, June 24, 2011
PRE-OUTAGE ARCHITECTURE


    •   AWS began a new service called “RDS” for Relational Data Service in 2010.
        This was a managed database service -- mySQL -- that was more powerful
        and simpler to administer than us doing so ourselves on large servers

    •   We migrated to RDS in 2010

    •   The remaining architecture, with the mid-instance front ends and load
        balancers, remained the same




Friday, June 24, 2011
KEY LESSONS LEARNED
    •   Amazon’s multiple availability zones failover are not reliable
         •    Does not imply separate physical or logical facilities!
         •    Amazon’s poor communication during the outage compounded this problem
         •    Due to Amazon’s poor initial incidence response communications, we on the spot decided to
              create new machine instances (AMIs) in a different geographic zone (US-West vs. US-East) and
              copy over the “offsite” one-day-old SVN and DB backups
         •    Downtime minimized to 1.5 hours; many websites (Reddit, Quora, Foursquare) were down for
              days
    •   Future Worst Case: Amazon goes completely offline. Means we need a very recent full backup of
        both WWW and DB instances in a physically and logically separate facility + ability to load balance/
        change DNS quickly
         •    Solution was to scale up Rackspace instances and make daily copies to those servers

Friday, June 24, 2011
2011 ARCHITECTURE




Friday, June 24, 2011
WHAT WE RECOMMEND
    •   get physically and logically separate backup servers
    •   do nightly full copy backups to the above servers
    •   have a clear, written process in place for the following things:
         •    communicating with superiors about what’s happening
         •    what steps need to be taken to failover
         •    when the “worst-case” failover plan is implemented (can be time-based or circumstance-based

              or both)
    •   either implement automatic load balancing or (not as good) have complete control over your DNS
    •   use a very good alerts monitoring service; some of the best ones are cheap/free. We use

        binarycanary.com.




Friday, June 24, 2011

More Related Content

What's hot

HA and DR for Cloud Workloads
HA and DR for Cloud WorkloadsHA and DR for Cloud Workloads
HA and DR for Cloud Workloads
swamybabu
 
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
Amazon Web Services
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
Grig Gheorghiu
 
AWS Customer Presentation - Visiware - AWS Summit Paris
AWS Customer Presentation -  Visiware - AWS Summit ParisAWS Customer Presentation -  Visiware - AWS Summit Paris
AWS Customer Presentation - Visiware - AWS Summit Paris
Amazon Web Services
 
Creating CentOS Template For CloudStack
Creating CentOS Template For CloudStackCreating CentOS Template For CloudStack
Creating CentOS Template For CloudStack
Shanker Balan
 
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
Symantec
 
SYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographicSYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographic
MZERMA Amine
 
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
ShapeBlue
 
Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Chris Bunch
 
A brief introduction to CloudFormation
A brief introduction to CloudFormationA brief introduction to CloudFormation
A brief introduction to CloudFormation
SWIFTotter Solutions
 
Deploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentationDeploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentationcapouch
 
Container management with docker & kubernetes
Container management with docker & kubernetesContainer management with docker & kubernetes
Container management with docker & kubernetes
Kasun Rajapakse
 
Architecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web ServicesArchitecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web Services
Edureka!
 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Derek Ashmore
 
Aws ec2
Aws ec2Aws ec2
Aws ec2
Bhavik Vashi
 
Henry been database-per-tenant with 50k databases
Henry been   database-per-tenant with 50k databasesHenry been   database-per-tenant with 50k databases
Henry been database-per-tenant with 50k databases
Henry Been
 
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Codemotion
 
Ph.D. Defense
Ph.D. DefensePh.D. Defense
Ph.D. Defense
Chris Bunch
 
AppScale @ LA.rb
AppScale @ LA.rbAppScale @ LA.rb
AppScale @ LA.rb
Chris Bunch
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows Azure
Nuno Godinho
 

What's hot (20)

HA and DR for Cloud Workloads
HA and DR for Cloud WorkloadsHA and DR for Cloud Workloads
HA and DR for Cloud Workloads
 
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
AWS Summit Auckland 2014 | AWSome Data Protection with Veeam - Session Sponso...
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
AWS Customer Presentation - Visiware - AWS Summit Paris
AWS Customer Presentation -  Visiware - AWS Summit ParisAWS Customer Presentation -  Visiware - AWS Summit Paris
AWS Customer Presentation - Visiware - AWS Summit Paris
 
Creating CentOS Template For CloudStack
Creating CentOS Template For CloudStackCreating CentOS Template For CloudStack
Creating CentOS Template For CloudStack
 
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
INFOGRAPHIC: #BackupExec 2014 - Backup Anything. Restore Anywhere.
 
SYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographicSYMANTEC Backup Exec 2014 - infographic
SYMANTEC Backup Exec 2014 - infographic
 
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
 
Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10Active Cloud DB at CloudComp '10
Active Cloud DB at CloudComp '10
 
A brief introduction to CloudFormation
A brief introduction to CloudFormationA brief introduction to CloudFormation
A brief introduction to CloudFormation
 
Deploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentationDeploying your application on open stack using bosh presentation
Deploying your application on open stack using bosh presentation
 
Container management with docker & kubernetes
Container management with docker & kubernetesContainer management with docker & kubernetes
Container management with docker & kubernetes
 
Architecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web ServicesArchitecting in Cloud : Your Guide to Amazon Web Services
Architecting in Cloud : Your Guide to Amazon Web Services
 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
 
Aws ec2
Aws ec2Aws ec2
Aws ec2
 
Henry been database-per-tenant with 50k databases
Henry been   database-per-tenant with 50k databasesHenry been   database-per-tenant with 50k databases
Henry been database-per-tenant with 50k databases
 
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
Building a multi-tenant application using 45.000 databases - Henry Been - Cod...
 
Ph.D. Defense
Ph.D. DefensePh.D. Defense
Ph.D. Defense
 
AppScale @ LA.rb
AppScale @ LA.rbAppScale @ LA.rb
AppScale @ LA.rb
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows Azure
 

Viewers also liked

Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Rajesh Prabhakar
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environment
iosrjce
 
Henry
HenryHenry
Henry
hdenn37
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?
ITpreneurs
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage Management
OMNETRIC
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proof
Guido Frabotti
 
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
Amazon Web Services
 
Successful Outage Management Lessons Learned From Global Generation Leaders
Successful Outage Management   Lessons Learned From Global Generation LeadersSuccessful Outage Management   Lessons Learned From Global Generation Leaders
Successful Outage Management Lessons Learned From Global Generation Leaders
TedLemmers
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
Newvewm
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
Nati Shalom
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20
Étienne Garbugli
 

Viewers also liked (12)

Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environment
 
Henry
HenryHenry
Henry
 
Cloud malfunction up11
Cloud malfunction up11Cloud malfunction up11
Cloud malfunction up11
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage Management
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proof
 
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
 
Successful Outage Management Lessons Learned From Global Generation Leaders
Successful Outage Management   Lessons Learned From Global Generation LeadersSuccessful Outage Management   Lessons Learned From Global Generation Leaders
Successful Outage Management Lessons Learned From Global Generation Leaders
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20
 

Similar to Dcpl cloud computing amazon fail

High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
RightScale
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSAcquia
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScalemmoline
 
Oracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introductionOracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introduction
Tom Laszewski
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Tom Laszewski
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Best practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_cachingBest practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_caching
yamingd
 
AWS Distilled
AWS DistilledAWS Distilled
AWS Distilled
Jeyaram Gurusamy
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Acquia
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWSTom Laszewski
 
cse40822-amazon.pptx
cse40822-amazon.pptxcse40822-amazon.pptx
cse40822-amazon.pptx
prathamgunj
 
Web App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptxWeb App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptx
PradeepK344324
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
Amazon Web Services
 
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS StorageAWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
Amazon Web Services
 
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel AvivScaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
Amazon Web Services
 
Harness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and AvereHarness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and Avere
Amazon Web Services
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
Amazon Web Services
 

Similar to Dcpl cloud computing amazon fail (20)

High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScale
 
Oracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introductionOracle Peoplesoft on AWS: A quick introduction
Oracle Peoplesoft on AWS: A quick introduction
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Best practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_cachingBest practices for_scaling_java_applications_with_distributed_caching
Best practices for_scaling_java_applications_with_distributed_caching
 
AWS Distilled
AWS DistilledAWS Distilled
AWS Distilled
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and Hosting
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
cse40822-amazon.pptx
cse40822-amazon.pptxcse40822-amazon.pptx
cse40822-amazon.pptx
 
Web App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptxWeb App Security -Pradeep K.pptx
Web App Security -Pradeep K.pptx
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS StorageAWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
AWS Webcast - How to Migrate On-premise NAS Storage to Cloud NAS Storage
 
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel AvivScaling up to your first 10 million users - Pop-up Loft Tel Aviv
Scaling up to your first 10 million users - Pop-up Loft Tel Aviv
 
Harness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and AvereHarness the Power of Hybrid Cloud with AWS and Avere
Harness the Power of Hybrid Cloud with AWS and Avere
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
 

Recently uploaded

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Dcpl cloud computing amazon fail

  • 1. AMAZON FAIL DC Public Library’s Lessons Learned from the Amazon Cloud Outage Friday, June 24, 2011
  • 2. BACKGROUND • DClibrary.org was first major DC Government website to use cloud-based hosting beginning circa June 2009 • Initial architecture designed to leverage low cost of large instances Amazon Web Services (AWS) servers for database operations and lower cost small and mid servers for WWW services • DClibrary.org Content Management System is Drupal 6 • Bonus: Experimental Drupal 7 amazon machine instance available on our website; currently undergoing user testing Friday, June 24, 2011
  • 3. WHAT WENT WRONG • Background: AWS de-couples the physical hard disk space (called Elastic Block Storage or EBS) from the CPUs (called “compute instances”) • late April 2011: an AWS engineer mistakenly routed “backplane” (internal server traffic) which connects EBS to the CPUS through a system that could not handle the load • This triggered an alarm; since everything in AWS is redundant, the systems thought the backup EBS drives had all failed simultaneously, causing an overload as the system tried to compensate • In a nutshell, it’s almost as if the CPUs no longer had hard drives Friday, June 24, 2011
  • 4. 2009 ARCHITECTURE • June 2009 architecture focused on load balancing and database replication across Amazon Availability Zones • SVN machine was also in cloud • Too reliant on one service provider (amazon) Friday, June 24, 2011
  • 5. PRE-OUTAGE ARCHITECTURE • AWS began a new service called “RDS” for Relational Data Service in 2010. This was a managed database service -- mySQL -- that was more powerful and simpler to administer than us doing so ourselves on large servers • We migrated to RDS in 2010 • The remaining architecture, with the mid-instance front ends and load balancers, remained the same Friday, June 24, 2011
  • 6. KEY LESSONS LEARNED • Amazon’s multiple availability zones failover are not reliable • Does not imply separate physical or logical facilities! • Amazon’s poor communication during the outage compounded this problem • Due to Amazon’s poor initial incidence response communications, we on the spot decided to create new machine instances (AMIs) in a different geographic zone (US-West vs. US-East) and copy over the “offsite” one-day-old SVN and DB backups • Downtime minimized to 1.5 hours; many websites (Reddit, Quora, Foursquare) were down for days • Future Worst Case: Amazon goes completely offline. Means we need a very recent full backup of both WWW and DB instances in a physically and logically separate facility + ability to load balance/ change DNS quickly • Solution was to scale up Rackspace instances and make daily copies to those servers Friday, June 24, 2011
  • 8. WHAT WE RECOMMEND • get physically and logically separate backup servers • do nightly full copy backups to the above servers • have a clear, written process in place for the following things: • communicating with superiors about what’s happening • what steps need to be taken to failover • when the “worst-case” failover plan is implemented (can be time-based or circumstance-based or both) • either implement automatic load balancing or (not as good) have complete control over your DNS • use a very good alerts monitoring service; some of the best ones are cheap/free. We use binarycanary.com. Friday, June 24, 2011