Amazon is pretty successful organization. Amazon team is able to innovate persistently and launch successful products to marketplace. Over a period of time Amazon has rejuvenated itself, it started journey as online book seller. Amazon started adding new products and services to their online store i.e. toys, clothes, shoes, furniture etc. By now Amazon is considered biggest online store on planet earth. Amazon has done lot of innovation in retail space. The Amazon team honed its infrastructure skills over period of time to achieve the scale it needs. In last few years Amazon decided to leverage its IT infrastructure scale advantage and offer IT infrastructure as a service for its customers i.e. storage services on cloud based platform as a service. By now Amazon is one of the major cloud based service providers. Amazon is considered to have cool work culture, in many journals you may find reference of Amazon’s “Just-Do-It” style culture. Many small and large organizations are using Amazon’s cloud based product and services. The organizations like Dropbox, Reddit, Pinterest, AirBnB, Netflix etc. are leveraging Amazon cloud products for running their businesses. The Cloud platform is mission critical to Amazon’s customers.
In recent past we have seen major outages at Amazon cloud based platform Amazon had major outage on Dec/24/2012, Oct/2012, Jun/2012. It seems now Amazon cloud outages pretty much as major quarterly event! The Amazon cloud snafus are causing major business disruptions to its customers i.e. over Christmas Eve many customers were unable to enjoy Netflix streaming services, Oct/2012 outage impacted organizations like Pinterest, AirBnB etc. We wonder an organization that is extremely successful in provided great products and services in retail space is failing (or struggling) in Cloud space. What are the potential reasons of outages, how to mitigate outages. I did analysis of the major Amazon Cloud incident considering root cause analysis published by Amazon, public opinions and customer commentary.
As per my analysis, I see process flaws (cloud operations) as constant theme in majority of cloud outages at Amazon. Software (probably SDLC) related issues are also observed as contributing factors. I look forward to hear your thoughts.