• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Amazon Cloud Major Outages Analysis
 

Amazon Cloud Major Outages Analysis

on

  • 4,532 views

Amazon is pretty successful organization. Amazon team is able to innovate persistently and launch successful products to marketplace. Over a period of time Amazon has rejuvenated itself, it started ...

Amazon is pretty successful organization. Amazon team is able to innovate persistently and launch successful products to marketplace. Over a period of time Amazon has rejuvenated itself, it started journey as online book seller. Amazon started adding new products and services to their online store i.e. toys, clothes, shoes, furniture etc. By now Amazon is considered biggest online store on planet earth. Amazon has done lot of innovation in retail space. The Amazon team honed its infrastructure skills over period of time to achieve the scale it needs. In last few years Amazon decided to leverage its IT infrastructure scale advantage and offer IT infrastructure as a service for its customers i.e. storage services on cloud based platform as a service. By now Amazon is one of the major cloud based service providers. Amazon is considered to have cool work culture, in many journals you may find reference of Amazon’s “Just-Do-It” style culture. Many small and large organizations are using Amazon’s cloud based product and services. The organizations like Dropbox, Reddit, Pinterest, AirBnB, Netflix etc. are leveraging Amazon cloud products for running their businesses. The Cloud platform is mission critical to Amazon’s customers.
In recent past we have seen major outages at Amazon cloud based platform Amazon had major outage on Dec/24/2012, Oct/2012, Jun/2012. It seems now Amazon cloud outages pretty much as major quarterly event! The Amazon cloud snafus are causing major business disruptions to its customers i.e. over Christmas Eve many customers were unable to enjoy Netflix streaming services, Oct/2012 outage impacted organizations like Pinterest, AirBnB etc. We wonder an organization that is extremely successful in provided great products and services in retail space is failing (or struggling) in Cloud space. What are the potential reasons of outages, how to mitigate outages. I did analysis of the major Amazon Cloud incident considering root cause analysis published by Amazon, public opinions and customer commentary.

As per my analysis, I see process flaws (cloud operations) as constant theme in majority of cloud outages at Amazon. Software (probably SDLC) related issues are also observed as contributing factors. I look forward to hear your thoughts.

Statistics

Views

Total Views
4,532
Views on SlideShare
4,333
Embed Views
199

Actions

Likes
4
Downloads
48
Comments
0

7 Embeds 199

http://allthingsd.com 123
http://www.amongtech.com 64
http://cloud.feedly.com 4
https://twitter.com 3
http://kred.com 2
https://www.linkedin.com 2
http://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Amazon Cloud Major Outages Analysis Amazon Cloud Major Outages Analysis Presentation Transcript

    • Amazon Major Cloud Outage Analysis Author: Rahul Tyagi
    • 2 The Agenda • The Issue • The Goals • Analysis Methodology • The Analysis
    • 3 The Issue • Due to deep proliferation of Amazon cloud into enterprises, The major Amazon cloud outages causes wide spread impact… • The organizations like Netflix, Dropbox, AirBnB and Pinterest had impact due to Amazon cloud outages
    • 4 The Issue • Major cloud outages are pretty regular events in recent past, some of the major outages • Dec/24/2012 • Oct/22/2012 • Jun/29/2012 • Apr/21/2011
    • 5 The Goals • We want to analyze chain of events causing major Amazon cloud outages (from official Amazon statements)… • We analyzed major outages in past 2 years… • The goal is to identify probable root causes and areas that have opportunity to improve…
    • 6 Analysis Methodology We would leverage “Analytical Hierarchy Process” for identifying root causes…
    • 7 Analysis Methodology Analyze Amazon’s Statements about Outage Identify “Chain of Events” causing outage Categorize “Chain of Events” Analysis and Conclusion
    • 8 The Analysis > Analyze Amazon’s Statements about Outages Outage Date Amazon’s Statement Dec/24/2012 http://aws.amazon.com/message/680587/ Oct/22/2012 http://aws.amazon.com/message/680342/ Jun/29/2012 http://aws.amazon.com/message/67457/ Apr/21/2011 http://aws.amazon.com/message/65648/ We analyzed following Amazon’s official statements…
    • 9 The Analysis > Identify “Chain of Events” causing outages Outage Core Issue Dec-12 “The *ELB State+ data was deleted by a maintenance process that was inadvertently run against the production ELB state data” Oct-12 “The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers” Jun-12 “In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT” Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” The statements in double quotes are from Amazon’s press releases…
    • 10 The Analysis > Identify “Chain of Events” causing outages Outage Chain of Events Dec-12"Maintenance process inadvertently run against production ELB state data" Process for incident approval had loose ends Validation for maintenance process's (which ran inadvertently) output was missing "load balancers that were modified were improperly configured by the control plane" Oct-12"latent bug in an (EBS) operational data collection agent" "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. "the DNS update did not successfully propagate to all of the internal DNS servers" "the (aggressive) throttling policy that was put in place was too aggressive" Jun-12"datacenter that did not successfully transfer to the generator backup" "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" "We will audit our change process and increase the automation to prevent this mistake from happening in the future" "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"
    • 11 The Analysis > Categorize “Chain of Events” Outage Chain of Events Hardware Software Automation Process Dec-12"Maintenance process inadvertently run against production ELB state data" X X Process for incident approval had loose ends X Validation for maintenance process's (which ran inadvertently) output was missing X X X "load balancers that were modified were improperly configured by the control plane" X Oct-12"latent bug in an (EBS) operational data collection agent" X X "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. X X "the DNS update did not successfully propagate to all of the internal DNS servers" X X "the (aggressive) throttling policy that was put in place was too aggressive" X X Jun-12"datacenter that did not successfully transfer to the generator backup" X "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" X "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" X X "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" X X Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” X "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" X "We will audit our change process and increase the automation to prevent this mistake from happening in the future" X "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster" X X
    • 12 The Analysis > Analysis and Conclusions Process issues are common theme in major outages at Amazon cloud…
    • 13 The Analysis > Analysis and Conclusions Software, 8 Automation, 4 Process, 14 #ofIssues Amazon Cloud Major Outage - Issues Categories Process and Software are leading contributing factors to major outages at Amazon…
    • 14 The Analysis > Analysis and Conclusions • The majority of issues contributing to outages are related to process or software • It seems “Process” rigor in cloud operations and SDLC at Amazon has opportunity to improve • Culture? We heard, Amazon has Just-Do-It culture, The process rigor may require more than just “just-do-it”
    • 15 Thank You! You are Awesome! You deserve applause!!