Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud Computing Outages - Analysis of Key Outages 2009 - 2012


Published on

Cloud Computing Outages

  • Be the first to comment

Cloud Computing Outages - Analysis of Key Outages 2009 - 2012

  1. 1. Cloud Computing - Outages Analysis of Key Outages 2009 - 2012Rajesh Prabhakar Analyst Bio @
  2. 2. Global Cloud Computing - Forecast Gartner Cloud Forecast US$BN 176.8 152.1 Cloud Computing – Benefits 128.9  Cost Effective and Pay per Use, Easy Accessibility 107.2 89.4 Anywhere Anytime, Few IT resources and Less IT74.3 Infrastructure, Scalability and Easy to use. Cloud Computing – Concerns  Data Security and Confidentiality, Control on Data, Data2010 2011 2012 2013 2014 2015 Access and Internet Connectivity Issues, Outages and Forrester Cloud Forecast US$BN 241 Third party dependence & Contract issues Business Adoption & Outlook 97  According to Gartners Executive Program Agenda 40.7 Survey”, Cloud computing has emerged as the top technological focus for CIOs. 2011 2015 2020  CIOs expect adoption of cloud technologies will free up to IDC Cloud Forecast US$BN 50 percent of infrastructure and operational resources, 72.9 which can be utilized for other strategic priorities  BFSI and manufacturing are the early adopters of cloud 44.5 services. Communications, high-tech industries and public sector are interested in the potential of cloud services. 21.5 17.4  North American and European markets are the largest markets and other geographies around the world will 2009 2010 2013 2015 experience growth Rajesh Prabhakar Analyst Bio @
  3. 3. Cloud Computing Outages 2012Vendor When Duration What Happened & WhyMicrosoft December More than Xbox 360 users were affected after Microsoft’s Cloud Save feature broke down on 28(Xbox Live & 28, 2012 till 36 Hours December. The outage continued for the whole weekend, with users unable to access savedAzure) December games held in the cloud until 31 December. Azure service was also disrupted between 28 and 31, 2012 30 December and Microsoft initially reported that only users of its storage service in the South Central US region were affected. However, it quickly became apparent the outage was also affecting its global Management Portal. The problem, which was blamed on ‘faulty nodes’ took over 36 hours to resolve in full, with Microsoft issuing an apology “for the interruption and issues it has caused our customers”.Amazon Web December 24 Hours The outage affected Netflix customers across the United States, Canada and Latin America. ItServices 24, 2012 began at 3:30 p.m. Eastern time on Christmas Eve and lasted for some users into Christmas Day. The cause of the failure, was a shutdown of several Elastic Load Balancers (ELB) that distribute network traffic to Netflix customers to support online streaming.Google Gmail December 18 Gmail was down for 18 minutes last week after a “routine update” briefly broke the e-mailOutage 10, 2012 Minutes service. The search giant reported that it conducted an update of its load-balancing software from 8:45 a.m. to 9:13 a.m. U.S. West Coast time, and after the problems were detected it managed to quickly roll back the buggy code.Microsoft Office November Nov 8 – 8 An antivirus issue caused the November 8 email issues, according Microsoft blog post. And365 8 & 13, hours November 13 outage was due to a combination of maintenance, "network element" and load 2012 Nov 13 – issues. The post also details steps Microsoft officials said they are taking to prevent these kinds 5 Hours of problems in the future.Drop Box October Several The interruptions led many to link the issues to an undetermined slowdown of Internet 26, 2012 Hours availability, although an exact cause has never been determined.Google App October 4 Hours At approximately 7:30 am Pacific time this morning, Google began experiencing slowEngine 26, 2012 performance and dropped connections from one of the components of App Engine. The symptoms that service users would experience include slow response and an inability to connect to services. The global restart plus additional load unexpectedly reduces the count of healthy traffic routers below the minimum required for reliable operation. This causes overload in the remaining traffic routers, spreading to all App Engine datacenters.Amazon Web October 6 hours Amazon states a hardware replacement at a Virginia data center that has been the site of otherServices 22, 2012 recent outages trigged a data collection bug that downed service for “a fraction” of its cloud customers like Pinterest, Reddit, FastCompany and Flipboard. The service disruptions, which hit customers such as Netflix, were first reported at approximately 1:00 p.m. EST Monday, Oct. 22 and piled up over the next three hours until troubleshooting relieved the issues by 7:15 p.m. EST, according to the postmortem.
  4. 4. Cloud Computing Outages 2012Vendor When Duration What Happened & WhyApple iCloud September 30 Hours iCloud’s email service was spotty or unavailable for what Apple describes on its iCloud status 11, 2012 page as 1.1 percent of its users. The outage, unexplained so far, prevented access via email client software and the iCloud Web site.Go Daddy September 6 hours Customers experienced intermittent service outages starting shortly after 10 a.m. PDT. Service 10, 2012 was fully restored by 4 p.m. PDT. the service outage was due to a series of internal network events that corrupted router data tables.Microsoft Azure July 26, 2 Hours Windows cloud service, Azure, suffered an outage on 26 July because of a "safety valve" 2012 configuration error. The outage affected parts of Western Europe for two hours.Twitter July 26, 2 Hours A data center glitch brought down Twitter for roughly 2 hours on Thursday, as the micro- 2012 blogging service suffered its second widespread outage in 5 weeks and another blow to its reputation and reliability.Google Talk July 26, 5 Hours Google Talk IM and video chat service was down in parts of the United States and across the 2012 globe--the third major cloud outage of the day.@ 06.50 AM and At 11:25 a.m., Google said it had fixed the problems and the service had been fully July 10, 7-8 hours suffered a service outage early as its data center provider attempted to replace 2012 electrical equipment. servers lost power and its technology team was not able to troubleshoot problems until power was restored. The root cause was a power failure in West Coast data center service was unavailable 1:24 a.m. to 8:30 a.m., or service performance was degraded during that period, as the team brought various systems back online.Amazon Web June 29, 2-3 hours Amazon Web Services EC2 cloud, Part of EC2 went down during a power outage that affectedServices 2012 its Ashburn, Va., data center June 29. The outage is impacting sites such as Instagram, Pinterest & Netflix, etc..Twitter June 21, 2 Hours Service crashed around 9.A.M Pacific time, came back online briefly around 10.10 A.M Pacific 2012 time but failed half-hour later and finally the site recovered by 11.08 A.M pacific time. Outage affected all platforms and took down both third party and Twitter apps on the Android & iOS platforms. A cascading bug in one of the Twitter’s infrastructure components was the cause of outage and was forced to roll back its software to previous stable version to restore services.Apple iCloud June 20, 4 Hours Some users of Apple’s iCloud and iMessage services reported outages and users were unable 2012 to send or receive messages on iOS or use the various iCloud features. Even Apple employees in its Cupertino headquarters too complained about the internet connection issues. Apple did not provide any explanation for outages.
  5. 5. Cloud Computing Outages 2012Vendor When Duration What Happened & WhyAmazon Web June 14, 2 Hours At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility powerServices 2012 distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan and volumes are left without primary or back up or secondary back up. The generator fan was fixed and services restarted by 10.40 PM PDT. DropBox, Quora, Heroku, Parse, Pinterest were companies that were affected by this outage.Google Gmail June 7, 90 Gmail web service was unavailable for more than 90 minutes and outage may have affectedOutage 2012 Minutes less than 1.38% of the Gmail user base. The problem was acknowledged around 11 am easter time and declared it resolved at 12.40 pm.Facebook May 31, 2 Hours Facebooks website suffered sporadic outages anywhere from half an hour to two hoursOutage 2012 according to various blogs, tweets and affected users, but the company said the problem has been fixed.Google Gmail April 17, 1 Hour Google announced the problem at about 12.40 pm eastern time in the Google Apps StatusOutage 2012 Dashboard and the issue was resolved at around 1.45 pm. The bug affected less than 10% of Gmail’s user base but initially Google announced less than 2% of Gmail users were affected and the root cause was a misconfiguration that occurred during a routine capacity upgrade which prevented changes to existing customer data for upgraded users.Amazon EC2 March 15, 22 Between 2:22 AM and 2:43 AM PDT internet connectivity was impaired in the US-EAST-1 2012 Minutes region in North Virginia Data Center due to a networking router bug caused a defective route to the internet. Full connectivity has been restored.Microsoft February ½ Day Partial service outage due to a leap year bug as the system despite recognizing the Feb 29,Windows Azure 29 2012 2012 generated a transfer certificate that virtual machines (VM) use to communicate betweenOutage Application VM and host operating system with one year expiration of Feb 29, 2013 the date which does not exist as 2013 is not leap year. The windows Azure dashboard also failed due to traffic overload as customers crowded site to know about the outage. Microsoft is providing a 33% credit to all the customers of Windows Azure regardless of whether they faced outage of not.Zoho SaaS January 8:25 a.m. Power outage at an Equinix data center in California took Zoho’s SaaS suite offline.suite offline 2012 to 6:12 Service went back online at 12:10 p.m. Mail went up an hour later, at 1:35 p.m.. Zoho CRM p.m. came back at 3:15, and only 60% of customers having access to their databases Pacific
  6. 6. Cloud Computing Outages 2011Vendor When Duration What Happened & WhyApple iPhone November 1 Day Siri loses even the most basic functionality when Apples servers are down. Because Siri4S Siri 2011 depends on servers to do the heavy computing required for voice recognition, the service is useless without that connection. Network outages caused the disruption according to Apple.Blackberry October 3 Days Outage was caused by a hardware failure (core switch failure) that prompted a "ripple effect" inoutage 2011 RIMs systems. Users in Europe, Middle East, Africa, India, Brazil, China and Argentina initially experienced email and message delays and complete outages and later the outages spread to North America too. Main problem is message backlogs and the downtime produced a huge queue of undelivered messages causing delays and traffic jams.Google Docs September 1 Hour Google Docs word collaboration application cramp, shutting out millions of users from their 2011 document lists, documents, drawings and Apps Scripts. Outage was caused by a memory management bug software engineers triggered in a change designed to “improve real time collaboration within the document list.Windows Live September 3 Hours Users did not have any data loss during the outage and the interruption was due to an issue inservices - 2011 Domain Name Service (DNS). Network traffic balancing tool had an update and the update didHotmail & not work properly which caused the issue.SkyDriveAmazon’s EC2 August 1-2 days Transformer exploded and caught fire near datacenter that resulted in power outage due tocloud & 2011 generator failure. Power back up systems at both the data centers failed causing power outages. Transformer explosion was caused by lightening strike but disputed by local utility provider. Rajesh Prabhakar Analyst Bio @
  7. 7. Cloud Computing Outages 2011Vendor When Duration What Happened & WhyMicrosoft’s August 1-2 days Transformer exploded and caught fire near datacenter that resulted in power outage due toBPOS 2011 generator failure. Power back up systems at both the data centers failed causing power outages. Transformer explosion was caused by lightening strike but disputed by local utility provider.Amazon Web April, 2011 4 Days During the upgrade, the traffic shift was executed incorrectly and rather than routing the traffic toServices the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. This led to Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. It also impacted the Relational Database Service (“RDS”). RDS depends upon EBS for database and log storage, and as a result a portion of the RDS databases hosted in the primary affected Availability Zone became inaccessible.Microsoft May 2011 2 Hours Paying customers email was delayed by as much as nine hours. Delay as outgoing messagesBPOS Outages started getting stuck in the pipeline.Twitter March & 1-4 Hours Outages due to over capacity and moving operations to new data center.Outages Feb 2011Intuit Quick March 2 Service failures on human error during scheduled maintenance operations. Inuit changed itsBooks Online 2011 Days network configuration and inadvertently blocked customer access to a portion of the company’s servers. A surge in traffic overloaded the servers when connectivity was restored, so the company opted to restore service.Google Mail February 2 Days Google mail and Google Apps users experienced login errors and empty mailboxes. Googleand Apps 2011 Engineering determined that the root cause was a bug inadvertently introduced in a GmailOutage storage software update. The bug caused the affected users’ messages and account settings to become temporarily unavailable from the datacenters. Rajesh Prabhakar Analyst Bio @
  8. 8. Cloud Computing Outages 2010Vendor When Duration What Happened & WhyHotmail December 3 Days A number of our users reported their email messages and folders were missing from theirOutage 2010 Hotmail accounts. Error occurred from a script that was meant to delete dummy accounts created for automated testing and it mistakenly targeted 17,000 real accounts instead.Skype Outage December 1 Day Cluster of support servers responsible for offline instant messaging became overloaded and the 2010 P2P network became unstable and suffered a critical failure. A supernode is important to the P2P network acting like a directory, supporting other Skype clients, helping to establish connections between them etc. The failure of 25–30% of supernodes in the P2P network resulted in an increased load on the remaining supernodes.Paypal Outage November 3 Hours A network hardware failure was the trigger for an outage. The hardware failure was worsened by 2010 problems in shifting traffic to another data center, resulting in about 90 minutes of downtime.Facebook September 2½ Outage due to an error condition. An automated system for verifying configuration values endedOutage 2010 Hours up causing much more damage than it fixed. Every single client saw the invalid value and attempted to fix it that led to a query to a database cluster and cluster was overloaded with thousand of queries per second. Even after fixing problem stream of queries continued.Microsoft August & 2 Hours A design issue in the upgrade that caused unexpected impact, but the issue resulted in a 2-hourBPOS Outages September period of intermittent access for BPOS organizations served from North America. 2010Wikipedia July & 2-3 In July, the power failure is understood to have affected Wikimedias pmtpa cluster. Due to theOutage March Hours temporary unavailability of several critical systems and the large impact on the available systems 2010 capacity, all Wikimedia projects went down. In March, Wikimedia servers overheated in the organizations European data center and shut themselves off automatically. Wikimedia then switched all its traffic to its server cluster in Florida, but the failover process, which involves changing servers DNS entries, malfunctioned, knocking the organizations sites offline around the June 2010 2 Hours Failure of a Cisco switch at the Newark, N.J., data center caused intermittent networkOutage connectivity. Dedicated switch had failed, the second failover switch had crashed as well and the problem was caused by a software June 2010 5 hours Increased activity on the site, combined with system enhancements and upgrades, haveoutage uncovered networking issues. Incidences of poor site performance and a high number of errors due to one of the internal sub-networks being over-capacity. Rajesh Prabhakar Analyst Bio @
  9. 9. Cloud Computing Outages 2009Vendor When Duration What Happened & January 1-2 Outages were caused by server disruption, when a core network device failed, stopping all dataOutage 2010, 2009 Hours from being processed in Japan, Europe, and North America. The technical reason for the outage: a core network device had failed, due to memory allocation errors. The backup plan, which was supposed to trigger a cut-over to a redundant system, also failed.Amazon’s EC2 June 2009 4-5 A lightning storm caused damage to a single Power Distribution Unit (PDU) in a single Hours Availability ZoneeBay Paypal August 1-4 Online payments system failed a couple of times led to non completion of transactions. Network 2009 Hours hardware issue is blamed for outage.Twitter August ½ Day A denial-of-service attack was blamed for the problem 2009Google Gmail September 2 hours Reasons from vendors include routing errors to server maintenance issues. 2009 2 timesMicrosoft October 6 days Microsoft’s Danger server farm, that holds the cloud T-Mobile Sidekick subscriber’s dataSidekick 2009 crashed, depriving users of their calendar, address book, and other key data. Critical data was lost during June 2009 1 Day Power outage and subsequent power generator failures that caused servers to fail. Companym Outage was forced to pay out between $2.5 million and $3.5 million in service credits to customers. December 1 Hour The issues resulted from a problem with a router used for peering and backbone connectivity 2009 located outside the data center at a peering facility, which handles approximately 20% of Rackspaces Dallas traffic. The router configuration error was part of final testing for data center integration between the Chicago and Dallas facilities. Rajesh Prabhakar Analyst Bio @
  10. 10. Cloud Computing Outages- Analysis1. First and foremost there is no escape from outages. Outages are bound to happen. Dominant players like Amazon, Salesforce, Microsoft and Google also had seen significant outages in the last three years.2. Outages have effected only small parts of the cloud and did not bring down the whole cloud. Problems persisted in single data centers which caused problems to small number of total users.3. Data loss was small and data was successfully recovered. The time lag in recovering the data was an issue in some cases there was time lag but data was restored.4. Outages were caused mostly during the software updates during the maintenance and updating cycles. The scripts and update errors at the datacenters caused the most of outages which effected the small part of the cloud users.5. Hardware issues were also to be blamed. Network failures, routers and switches problem and the overloads of the networks.6. Power failures and lack of proper infrastructure are other reasons for outages, Data storage and database issues in terms of proper data back ups are other areas of concern.7. Communication was another issue and all the vendors were not communicating with the clients during the outages and this has lead to some negativity towards cloud computing. In some cases the vendors are not reporting properly about the steps taken to avoid the future outages.8. There was a failure to route the traffic to other datacenters when one data center goes down or when most of the network gets overloaded in the data center there was no options for cloud providers to deviate or route the traffic to some other point where the traffic can be controlled.9. PR issues, the outages were publicized on the social media networks like twitter and blogs and there was a massive discussion going on and the outages were tracked by the users on an hourly basis which put more pressure on the service providers and cloud users.10. All the major vendors have learnt significant lessons during these outages and focusing in improving their service and invest in networks and datacenters and other infrastructure. The cloud computing is seeing good growth in last three years and the outages are helping them to improve further. Rajesh Prabhakar Analyst Bio @
  11. 11. Cloud Computing Outages- Strategies to be adopted For Customers or Clients1. Clients/customers should understand that cloud computing also can fail and result in major disaster for the business. The notion that it is absolutely safe on cloud is wrong.2. Clients should study cloud service provider offering thoroughly in terms of the SLAs, disaster recovery plans, back ups and storage, response times in outages, communication process to be followed during outages, financial liability and loss security etc.3. Cloud computing is mostly used by social networking companies and email and productivity suit providers which are used by most people and the communication is very critical with the people as the loyalties will shift fast if the users are not provided with proper information. Communication is very critical and may save ort destroy the business.4. Clients also should have necessary back up, storage and other facilities and disaster recovery plans on their premises so as to over come the worst scenarios. Data loss will have severe implications both in terms of financial loss and reputation loss.5. Choose the best service provider who has the capability to service increasing volumes in future and has the capability to invest and develop and grow in the cloud. The capability of cloud service provider to not only secure the data and reduce the cost but also have to have expertise in foreseeing the cloud growth problems and overcome them.6. Clients should look at cloud as a tool to effectively achieve the organizational goals and should work with the cloud providers closely for mutual benefit.7. Clients have to document all the risks associated with the cloud computing and all the risks have to be mitigated and the management have to employ skilled people to manage the cloud activities and coordinate with the cloud service providers during the outage.8. Outages have to be studied properly and necessary steps and mitigation strategies have to be formulated to avoid the repeated occurrence in future.9. Contracts and financial penalties for outages have to be clearly defined with out ambiguity. Rajesh Prabhakar Analyst Bio @
  12. 12. Cloud Computing Outages- Strategies to be adopted For Cloud Service Providers1. Service offerings have to be clearly defined. All the SLAs, Disaster recovery mechanism, Data storage and recovery, Response times, communication process, financial penalties, etc have to be clearly mentioned in the contracts.2. Infrastructure has to be developed and datacenters have to be installed at various geographical locations and systems have to be secured from power failures and other natural disasters.3. Employ skilled experts for development, maintenance and updating cloud software and necessary hardware also has to be installed. Outages were caused due to improper updating cycles like script errors and software bugs.4. Technology has to be developed as recent outages highlight the fact that cloud computing issues are recurring on regular basis and the necessary technology advancements have not been adopted. Lack of proper coordination among the service providers and failure to coordinate their efforts in defining industry standards is another issue.5. Most of the outages have been due to lack of proper understanding or foresight into the cloud demand and issues and all the service providers must have all risks documented and frameworks in place to tackle all the issues.6. Failure to communicate properly during cloud outages is to be avoided with an agreed communication policy in place and necessary reports have to be provided once the outage is successfully tackled. All the reports should be provided all the stakeholders accordingly.7. Social media is actively involved in cloud computing as major user and also plays a major role in creating havoc during outages as worried users will crowd online and critically discuss the outage and hence the service providers have to be very careful during this time.8. Service providers have to not only work closely with clients but also with the networking equipment providers and data centers and innovate and develop the necessary future technologies for the effective cloud computing development. Invest in the R&D is the critical differentiator and success factor for the existing service providers and new entrants.9. Service providers have to make clients understand how cloud computing will be a necessary tool in achieving the organizational goals and constantly work with organizations to improve the services and offer new services. Rajesh Prabhakar Analyst Bio @