SlideShare a Scribd company logo
1 of 25
Download to read offline
Lessons from a Cloud Malfunction
                                    An Analysis of a Global System Failure

                                    Alex Maclinovsky – Architecture Innovation
                                    12.09.2011
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accentu re.
Introduction
For most of this presentation we will discuss a global
outage of Skype that took place around December
22nd 2010. But this story is not about Skype. It is
about building dependable internet-scale systems
and will be based on Skype case because there are
many similarities among complex distributed
systems, and often certain failure mechanisms
manifest themselves again and again.
Disclaimer: this analysis is not based on insider knowledge: I
relied solely on CIO’s blog, my observations during the event,
and experience in running similar systems.
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        2
Approximate Skype Outage Timeline
Weeks             A buggy version of windows client released
before
incident          Bug identified, fixed, new version released
                  Cluster of support servers responsible for offline instant
0 min.
                  messaging became overloaded and failed
                  Buggy clients receive delayed messages and begin to crash -
+30 min.
                  20% of total
+1 hour           30% of the publicly available super-nodes down, traffic surges

+2 hours          Orphaned clients crowd surviving super-nodes, latter self destruct
+3 hours          Cloud disintegrates
+6 hours          Skype introduces recovery super-nodes, recovery starts

+12 hours Recovery slow, resources cannibalized to introduce more nodes
+24 hours Cloud recovery complete
+48 hours Cloud stable, resources released
+72 hours Sacrificed capabilities restored
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                                       3
Eleven Lessons
1.      Pervasive Monitoring
2.      Early warning systems
3.      Graceful degradation
4.      Contagious failure awareness
5.      Design for failures
6.      Fail-fast and exponential back-off
7.      Scalable and fault-tolerant control plane
8.      Fault injection testing
9.      Meaningful failure messages
10. Efficient, timely and honest communication to the end users
11. Separate command, control and recovery infrastructure

 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                         4
1 - Pervasive Monitoring
It looks like Skype either did not have sufficient monitoring in place or
their monitoring was not actionable (there were no alarms triggered when
the system started behaving abnormally).
• Instrument everything
• Collect, aggregate and store telemetry
• Make results available in (near)real time
• Constantly monitor against normal operational boundaries
• Detect trends
• Generate events, raise alerts, trigger recovery actions
• Go beyond averages (tp90, tp99, tp999)

 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                            5
2 - Early warning systems
Build mechanisms that predict that a problem is
approaching via trend analysis and cultivation:
• Monitor trends
• Use “canaries”
• Use PID controllers
• Look for unusual
deviations between
correlated values
Skype did not detect that the support cluster is approaching the tip-over point

 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                                  6
2 - Early warning: trend spotting
Look for unexpected changes in system state and behavior
over time.             Step function         Unexpected trend
                                                                         Abnormality




                                                                                       Value
Common                                                                                                       Abnormality

problem                                                                        Time                                Time
predictor
patterns                                                    Worst case                           Deviation between
                                                            degradation                           correlated values




                                                                                       Value
                                                   tp99                                                       Abnormality
                                                                         Abnormality
                                                                                               CPU                    TPS
                                                                          Average


                                                                             Time                                 Time


 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                                                                            7
3 - Graceful degradation
Graceful degradation occurs when in response to non-
normative operating conditions (e.g. overload, resource
exhaustion, failure of components or downstream
dependencies, etc.) system continues to operate, but provides
a reduced level of service rather than suffering a catastrophic
degradation (failing completely). It should be viewed as a
complementary mechanism to fault tolerance in the design of
highly available distributed systems.
Overload protection, such as load
shedding or throttling, would have
caused this event to fizzle out as a
minor QoS violation which most
Skype’s users would have never
noticed.
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        8
3 - Graceful degradation - continued
There were two places where overload protection was missing:
 • simple traditional throttling in the
   support cluster would have kept it
   from failing and triggering the rest
   of the event.
 • Once that cluster failed, a more
   sophisticated globally distributed
   throttling mechanism could have
   prevented the contagious failure
   of the “supernodes” which was
   the main reason for the global
   outage.
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        9
4 - Contagious failure awareness
• P2P cloud outage was a classic example of positive
  feedback induced contagious failure scenario
• Occurs when a failure of a component in a redundant
  system increases the probability of failure of its peers that
  are supposed to take over and compensate for the initial
  failure.
• This mechanism is quite common and is responsible for a
  number of infamous events ranging from Tu-144 crashes to
  the credit default swap debacle
• Grid architectures are susceptible to
  contagious failure: e.g. 2009 Gmail outage

Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        10
5 - Design for failures
• Since failures are inevitable, dependable
  systems need to follow the principles of
  Recovery-Oriented Computing (ROC), by
  aiming at recovery from failures rather
  than failure-avoidance.
• Use built-in auto-recovery:
              restart  reboot  reimage  replace
• Root issue was client version-specific, and they
  should have assumed it:
  – check if there is a newer version and upgrade
  – downgrade to select earlier versions flagged as “safe”.
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        11
5 - Design for failures: universal
auto-recovery strategy




Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        12
6 - Fail-fast and exponential back-off
• Vitally important in highly distributed systems
  to avoid self-inflicted distributed denial of
  service (DDoS) attacks similar to the one
  which decimated the supernodes
• Since there were humans in the chain, the
  back-off state should have been sticky:
  persisted somewhere to prevent circumvention
  by constantly restarting of the client
• When building a system where the same
  request might be retried by different agents, it
  is important to implement persistent global
  back-off to make sure that no operations are
  retried more frequently that permitted.
 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                         13
7 - Scalable fault-tolerant control plane
• Build a 23-million-way Big Red Button - ability to
  instantly control the "Flash Crowd".
• Most distributed systems focus on scaling the data
  plane and assume control plane is insignificant
• Skype’s control plane for the client relied on relay
  by the supernodes and was effectively disabled
  when the cloud disintegrated
   – Backing it up with a simple
     RSS-style command feed would
     have made it possible to control
     the cloud even in dissipated state.
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        14
8 - Fault injection testing
• Skype's problem originated and became critical in
  the parts of the system that were dealing with
  failures of other components.
• This is very typical – often 50% of the code and
  overall complexity is dedicated to fault handling.
• This code is extremely difficult to test outside
  rudimentary white-box unit tests, so in most cases
  it remains untested.
   – Almost never regression tested, making it the most stale
     and least reliable part of the system
   – Is invoked when the system is experiencing problems
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        15
8 - Fault injection testing framework
• Requires an on-demand fault injection framework.
• Framework intercepts and controls all
  communications between components and layers.
• Exposes API to simulate all conceivable kinds of
  failures:
   – total and intermittent component outages
   – communication failures
   – SLA violations



Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        16
9 - Meaningful failure messages
• Throughout the event, both the client and central
  site were reporting assorted problems that were
  often unrelated to what was actually happening:
   – e.g. at some point my Skype client complained that my
     credentials were incorrect
• Leverage crowdsourcing:
   –    clear and relevant error messages
   –    easy ways for users to report problems
   –    real-time aggregation
   –    secondary monitoring and alerting network

Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        17
10 - Communication to the end users
• Efficient, timely and honest communication to the
  end users is the only way to run Dependable
  Systems
• Dedicated Status site for humans
• Status APIs and Feeds




Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        18
11 - Separate command, control and
recovery infrastructure
• Have a physically separate and logically
  independent emergency command, control and
  recovery infrastructure
• Catalog technologies, systems and dependencies
  used to detect faults, build, deploy and (re)start
  applications, communicate status
• Avoid circular dependencies
• Use separate implementations or old stable
  versions

Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        19
11 - Separate command, control and
recovery infrastructure - blueprint
                                         All Other Systems                                   Command, Control and Recovery Infrastructure




                                                                       Manage

                                                                                                               us
                                                                                                                    e
                                                                                            Monitoring &




                                                                               ge
                                                                                              Control                       Standalone Data Cloud




                                                                            na
                                                                         Ma
                                                                                             Systems
                                                                            Ma
                                                 use
                                                                                 int
                                                                                    ain




                                                                                                                        e
                                                                                                                 us
                                    Data Cloud                                   Maintain


                                                                                           Build & Deploy Systems
                                Binary Content
                                                                            e
                                                                          ag




                                                                                            n
                                                                       an




                                                                                            ai
                                                             use




                                                                                          nt
                                                                     M



                                                                                       ai
                                                                                    M




                                                 Standalone Control & Recovery Stack




Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                                                                                                    20
Cloud Outage Redux
On April 21st 2011 operator error during
maintenance caused an outage of the Amazon
Elastic Block Store (“EBS”) Service, which in turn
brought down parts of the Amazon Elastic Compute
Cloud (“EC2”) and Amazon Relational Database
Service (RDS) Services. The outage lasted 54 hours
and data recovery took several more
days and 0.07% of the EBS volumes
were lost.
The outage affected over 100 sites, including big
names like Reddit, Foursquare and Moby

 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                         21
EC2 Outage Sequence
• An operator error occurs during routine
  maintenance operation
• Production traffic routed into low capacity
  backup network
• Split brain occurs
• When network is restored, system enters
  “re-mirroring storm”
• EBS control plane overwhelmed
• Dependent services start failing
Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                        22
Applicable Lessons
1.      Pervasive Monitoring
2.      Early warning systems
3.      Graceful degradation
4.      Contagious failure awareness
5.      Design for failures
6.      Fail-fast and exponential back-off
7.      Scalable and fault-tolerant control plane
8.      Fault injection testing
9.      Meaningful failure messages
10. Efficient, timely and honest communication to the end users
11. Separate command, control and recovery infrastructure

 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                         23
Further Reading and contact Information
Skype Outage Postmortem - http://blogs.skype.com/en/2010/12/cio_update.htm l
Amazon Outage Postmortem - http://aws.amazon.com/message/65648/

Recovery-Oriented Computing                                      - http://roc.cs.berkeley.edu/roc_overview.html

Designing and Deploying Internet-Scale Services, James Hamilton -
   http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa_DesigningServices.pptx

Design Patterns for Graceful Degradation, Titos Saridakis -
   http://www.springerlink.com/content/m7452413022t53w1/



Alex Maclinovsky blogs at http://randomfour.com/blog/
and can be reached via:
 alexey.v.maclinovsky@accenture.com
 Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.
                                                                                                                  24
Questions & Answers




Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.

More Related Content

What's hot

Refense Security Risk Briefing July 2009
Refense   Security Risk Briefing   July 2009Refense   Security Risk Briefing   July 2009
Refense Security Risk Briefing July 2009apompliano
 
Damballa automated breach defense june 2014
Damballa automated breach defense   june 2014Damballa automated breach defense   june 2014
Damballa automated breach defense june 2014Ricardo Resnik
 
Not my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignmentsNot my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignmentsThomas Zimmermann
 
SolarWinds or HP SiteScope Comparison
SolarWinds or HP SiteScope ComparisonSolarWinds or HP SiteScope Comparison
SolarWinds or HP SiteScope ComparisonSolarWinds
 
Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...
Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...
Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...Skybox Security
 
VMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & Replication
VMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & ReplicationVMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & Replication
VMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & ReplicationSuministros Obras y Sistemas
 
Secure Delivery Center, Eclipse Open Source
Secure Delivery Center, Eclipse Open SourceSecure Delivery Center, Eclipse Open Source
Secure Delivery Center, Eclipse Open SourceGenuitec, LLC
 
Open-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact AssessmentOpen-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact AssessmentPriyanka Aash
 
Full stack vulnerability management at scale
Full stack vulnerability management at scaleFull stack vulnerability management at scale
Full stack vulnerability management at scaleEoin Keary
 
ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1Agenor Technology Ltd
 
IT Security Risk Mitigation Report: Virtualization Security
IT Security Risk Mitigation Report: Virtualization SecurityIT Security Risk Mitigation Report: Virtualization Security
IT Security Risk Mitigation Report: Virtualization SecurityBooz Allen Hamilton
 
Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...
Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...
Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...David Etue
 
Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...
Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...
Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...Outpost24
 
Use nix cloud computing w. v-mware vcloud director
Use nix   cloud computing w. v-mware vcloud directorUse nix   cloud computing w. v-mware vcloud director
Use nix cloud computing w. v-mware vcloud directorVenkata Ramana
 
Vulnerability Management Program
Vulnerability Management ProgramVulnerability Management Program
Vulnerability Management ProgramDennis Chaupis
 
New Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningNew Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningJason Dea
 

What's hot (18)

DamballaOverview
DamballaOverviewDamballaOverview
DamballaOverview
 
Refense Security Risk Briefing July 2009
Refense   Security Risk Briefing   July 2009Refense   Security Risk Briefing   July 2009
Refense Security Risk Briefing July 2009
 
Damballa automated breach defense june 2014
Damballa automated breach defense   june 2014Damballa automated breach defense   june 2014
Damballa automated breach defense june 2014
 
Not my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignmentsNot my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignments
 
SolarWinds or HP SiteScope Comparison
SolarWinds or HP SiteScope ComparisonSolarWinds or HP SiteScope Comparison
SolarWinds or HP SiteScope Comparison
 
Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...
Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...
Transitioning to Next-Generation Firewall Management - 3 Ways to Accelerate t...
 
VMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & Replication
VMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & ReplicationVMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & Replication
VMware Recovery: 77x Faster! NEW ESG Lab Review, with Veeam Backup & Replication
 
Secure Delivery Center, Eclipse Open Source
Secure Delivery Center, Eclipse Open SourceSecure Delivery Center, Eclipse Open Source
Secure Delivery Center, Eclipse Open Source
 
Open-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact AssessmentOpen-Source Security Management and Vulnerability Impact Assessment
Open-Source Security Management and Vulnerability Impact Assessment
 
Full stack vulnerability management at scale
Full stack vulnerability management at scaleFull stack vulnerability management at scale
Full stack vulnerability management at scale
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1
 
IT Security Risk Mitigation Report: Virtualization Security
IT Security Risk Mitigation Report: Virtualization SecurityIT Security Risk Mitigation Report: Virtualization Security
IT Security Risk Mitigation Report: Virtualization Security
 
Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...
Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...
Control Quotient: Adaptive Strategies For Gracefully Losing Control (RSAC US ...
 
Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...
Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...
Outpost24 webinar - Differentiating vulnerabilities from risks to reduce time...
 
Use nix cloud computing w. v-mware vcloud director
Use nix   cloud computing w. v-mware vcloud directorUse nix   cloud computing w. v-mware vcloud director
Use nix cloud computing w. v-mware vcloud director
 
Vulnerability Management Program
Vulnerability Management ProgramVulnerability Management Program
Vulnerability Management Program
 
New Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningNew Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery Planning
 

Viewers also liked

Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?ITpreneurs
 
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Rajesh Prabhakar
 
Dcpl cloud computing amazon fail
Dcpl cloud computing amazon failDcpl cloud computing amazon fail
Dcpl cloud computing amazon failchris tonjes
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environmentiosrjce
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementOMNETRIC
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proofGuido Frabotti
 
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014Amazon Web Services
 
Successful Outage Management Lessons Learned From Global Generation Leaders
Successful Outage Management   Lessons Learned From Global Generation LeadersSuccessful Outage Management   Lessons Learned From Global Generation Leaders
Successful Outage Management Lessons Learned From Global Generation LeadersTedLemmers
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud OutageNewvewm
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud OutageNati Shalom
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20Étienne Garbugli
 

Viewers also liked (12)

Henry
HenryHenry
Henry
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?
 
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
 
Dcpl cloud computing amazon fail
Dcpl cloud computing amazon failDcpl cloud computing amazon fail
Dcpl cloud computing amazon fail
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environment
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage Management
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proof
 
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
(SEC404) Incident Response in the Cloud | AWS re:Invent 2014
 
Successful Outage Management Lessons Learned From Global Generation Leaders
Successful Outage Management   Lessons Learned From Global Generation LeadersSuccessful Outage Management   Lessons Learned From Global Generation Leaders
Successful Outage Management Lessons Learned From Global Generation Leaders
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20
 

Similar to Cloud malfunction up11

CA Nimsoft Monitor for Vblock
CA Nimsoft Monitor for VblockCA Nimsoft Monitor for Vblock
CA Nimsoft Monitor for VblockCA Nimsoft
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Splunk
 
Monitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersMonitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersPrince JabaKumar
 
Migrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systemsMigrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systemsMarkus Eisele
 
Migrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsMigrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsLightbend
 
Reactive Architecture
Reactive ArchitectureReactive Architecture
Reactive ArchitectureKnoldus Inc.
 
What does performance mean in the cloud
What does performance mean in the cloudWhat does performance mean in the cloud
What does performance mean in the cloudMichael Kopp
 
Case Study: Datalink—Manage IT monitoring the MSP way
Case Study: Datalink—Manage IT monitoring the MSP wayCase Study: Datalink—Manage IT monitoring the MSP way
Case Study: Datalink—Manage IT monitoring the MSP wayCA Technologies
 
Apica - Performance Does Matter: Five Key Elements to Consider in the Cloud
Apica - Performance Does Matter: Five Key Elements to Consider in the CloudApica - Performance Does Matter: Five Key Elements to Consider in the Cloud
Apica - Performance Does Matter: Five Key Elements to Consider in the CloudRightScale
 
Adaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computingAdaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computingwww.pixelsolutionbd.com
 
Automotive communication systems: from dependability to security
Automotive communication systems: from dependability to securityAutomotive communication systems: from dependability to security
Automotive communication systems: from dependability to securityRealTime-at-Work (RTaW)
 
Automotive communication systems: from dependability to security
Automotive communication systems: from dependability to securityAutomotive communication systems: from dependability to security
Automotive communication systems: from dependability to securityNicolas Navet
 
Dan Cornell - The Real Cost of Software Remediation
Dan Cornell  - The Real Cost of Software RemediationDan Cornell  - The Real Cost of Software Remediation
Dan Cornell - The Real Cost of Software RemediationSource Conference
 
Real Cost of Software Remediation
Real Cost of Software RemediationReal Cost of Software Remediation
Real Cost of Software RemediationDenim Group
 
stackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfeestackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfeeGaurav "GP" Pal
 
Efficient combinatorial models for reliability analysis of complex dynamic sy...
Efficient combinatorial models for reliability analysis of complex dynamic sy...Efficient combinatorial models for reliability analysis of complex dynamic sy...
Efficient combinatorial models for reliability analysis of complex dynamic sy...ASQ Reliability Division
 
Building Reactive applications with Akka
Building Reactive applications with AkkaBuilding Reactive applications with Akka
Building Reactive applications with AkkaKnoldus Inc.
 
Performance and Success: Key Elements to Consider in the Cloud
Performance and Success: Key Elements to Consider in the CloudPerformance and Success: Key Elements to Consider in the Cloud
Performance and Success: Key Elements to Consider in the CloudRightScale
 
AVA Borealis - An architectural overview of AVA Labs` Platform
AVA Borealis - An architectural overview of AVA Labs` PlatformAVA Borealis - An architectural overview of AVA Labs` Platform
AVA Borealis - An architectural overview of AVA Labs` PlatformAlexander Schefer
 
Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...
Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...
Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...Novell
 

Similar to Cloud malfunction up11 (20)

CA Nimsoft Monitor for Vblock
CA Nimsoft Monitor for VblockCA Nimsoft Monitor for Vblock
CA Nimsoft Monitor for Vblock
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
 
Monitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersMonitoring Clusters and Load Balancers
Monitoring Clusters and Load Balancers
 
Migrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systemsMigrating from Java EE to cloud-native Reactive systems
Migrating from Java EE to cloud-native Reactive systems
 
Migrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsMigrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive Systems
 
Reactive Architecture
Reactive ArchitectureReactive Architecture
Reactive Architecture
 
What does performance mean in the cloud
What does performance mean in the cloudWhat does performance mean in the cloud
What does performance mean in the cloud
 
Case Study: Datalink—Manage IT monitoring the MSP way
Case Study: Datalink—Manage IT monitoring the MSP wayCase Study: Datalink—Manage IT monitoring the MSP way
Case Study: Datalink—Manage IT monitoring the MSP way
 
Apica - Performance Does Matter: Five Key Elements to Consider in the Cloud
Apica - Performance Does Matter: Five Key Elements to Consider in the CloudApica - Performance Does Matter: Five Key Elements to Consider in the Cloud
Apica - Performance Does Matter: Five Key Elements to Consider in the Cloud
 
Adaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computingAdaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computing
 
Automotive communication systems: from dependability to security
Automotive communication systems: from dependability to securityAutomotive communication systems: from dependability to security
Automotive communication systems: from dependability to security
 
Automotive communication systems: from dependability to security
Automotive communication systems: from dependability to securityAutomotive communication systems: from dependability to security
Automotive communication systems: from dependability to security
 
Dan Cornell - The Real Cost of Software Remediation
Dan Cornell  - The Real Cost of Software RemediationDan Cornell  - The Real Cost of Software Remediation
Dan Cornell - The Real Cost of Software Remediation
 
Real Cost of Software Remediation
Real Cost of Software RemediationReal Cost of Software Remediation
Real Cost of Software Remediation
 
stackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfeestackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfee
 
Efficient combinatorial models for reliability analysis of complex dynamic sy...
Efficient combinatorial models for reliability analysis of complex dynamic sy...Efficient combinatorial models for reliability analysis of complex dynamic sy...
Efficient combinatorial models for reliability analysis of complex dynamic sy...
 
Building Reactive applications with Akka
Building Reactive applications with AkkaBuilding Reactive applications with Akka
Building Reactive applications with Akka
 
Performance and Success: Key Elements to Consider in the Cloud
Performance and Success: Key Elements to Consider in the CloudPerformance and Success: Key Elements to Consider in the Cloud
Performance and Success: Key Elements to Consider in the Cloud
 
AVA Borealis - An architectural overview of AVA Labs` Platform
AVA Borealis - An architectural overview of AVA Labs` PlatformAVA Borealis - An architectural overview of AVA Labs` Platform
AVA Borealis - An architectural overview of AVA Labs` Platform
 
Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...
Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...
Mitigating Risk for the Mobile Worker: Novell ZENworks Endpoint Security Mana...
 

Cloud malfunction up11

  • 1. Lessons from a Cloud Malfunction An Analysis of a Global System Failure Alex Maclinovsky – Architecture Innovation 12.09.2011 Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accentu re.
  • 2. Introduction For most of this presentation we will discuss a global outage of Skype that took place around December 22nd 2010. But this story is not about Skype. It is about building dependable internet-scale systems and will be based on Skype case because there are many similarities among complex distributed systems, and often certain failure mechanisms manifest themselves again and again. Disclaimer: this analysis is not based on insider knowledge: I relied solely on CIO’s blog, my observations during the event, and experience in running similar systems. Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 2
  • 3. Approximate Skype Outage Timeline Weeks A buggy version of windows client released before incident Bug identified, fixed, new version released Cluster of support servers responsible for offline instant 0 min. messaging became overloaded and failed Buggy clients receive delayed messages and begin to crash - +30 min. 20% of total +1 hour 30% of the publicly available super-nodes down, traffic surges +2 hours Orphaned clients crowd surviving super-nodes, latter self destruct +3 hours Cloud disintegrates +6 hours Skype introduces recovery super-nodes, recovery starts +12 hours Recovery slow, resources cannibalized to introduce more nodes +24 hours Cloud recovery complete +48 hours Cloud stable, resources released +72 hours Sacrificed capabilities restored Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 3
  • 4. Eleven Lessons 1. Pervasive Monitoring 2. Early warning systems 3. Graceful degradation 4. Contagious failure awareness 5. Design for failures 6. Fail-fast and exponential back-off 7. Scalable and fault-tolerant control plane 8. Fault injection testing 9. Meaningful failure messages 10. Efficient, timely and honest communication to the end users 11. Separate command, control and recovery infrastructure Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 4
  • 5. 1 - Pervasive Monitoring It looks like Skype either did not have sufficient monitoring in place or their monitoring was not actionable (there were no alarms triggered when the system started behaving abnormally). • Instrument everything • Collect, aggregate and store telemetry • Make results available in (near)real time • Constantly monitor against normal operational boundaries • Detect trends • Generate events, raise alerts, trigger recovery actions • Go beyond averages (tp90, tp99, tp999) Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 5
  • 6. 2 - Early warning systems Build mechanisms that predict that a problem is approaching via trend analysis and cultivation: • Monitor trends • Use “canaries” • Use PID controllers • Look for unusual deviations between correlated values Skype did not detect that the support cluster is approaching the tip-over point Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 6
  • 7. 2 - Early warning: trend spotting Look for unexpected changes in system state and behavior over time. Step function Unexpected trend Abnormality Value Common Abnormality problem Time Time predictor patterns Worst case Deviation between degradation correlated values Value tp99 Abnormality Abnormality CPU TPS Average Time Time Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 7
  • 8. 3 - Graceful degradation Graceful degradation occurs when in response to non- normative operating conditions (e.g. overload, resource exhaustion, failure of components or downstream dependencies, etc.) system continues to operate, but provides a reduced level of service rather than suffering a catastrophic degradation (failing completely). It should be viewed as a complementary mechanism to fault tolerance in the design of highly available distributed systems. Overload protection, such as load shedding or throttling, would have caused this event to fizzle out as a minor QoS violation which most Skype’s users would have never noticed. Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 8
  • 9. 3 - Graceful degradation - continued There were two places where overload protection was missing: • simple traditional throttling in the support cluster would have kept it from failing and triggering the rest of the event. • Once that cluster failed, a more sophisticated globally distributed throttling mechanism could have prevented the contagious failure of the “supernodes” which was the main reason for the global outage. Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 9
  • 10. 4 - Contagious failure awareness • P2P cloud outage was a classic example of positive feedback induced contagious failure scenario • Occurs when a failure of a component in a redundant system increases the probability of failure of its peers that are supposed to take over and compensate for the initial failure. • This mechanism is quite common and is responsible for a number of infamous events ranging from Tu-144 crashes to the credit default swap debacle • Grid architectures are susceptible to contagious failure: e.g. 2009 Gmail outage Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 10
  • 11. 5 - Design for failures • Since failures are inevitable, dependable systems need to follow the principles of Recovery-Oriented Computing (ROC), by aiming at recovery from failures rather than failure-avoidance. • Use built-in auto-recovery: restart  reboot  reimage  replace • Root issue was client version-specific, and they should have assumed it: – check if there is a newer version and upgrade – downgrade to select earlier versions flagged as “safe”. Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 11
  • 12. 5 - Design for failures: universal auto-recovery strategy Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 12
  • 13. 6 - Fail-fast and exponential back-off • Vitally important in highly distributed systems to avoid self-inflicted distributed denial of service (DDoS) attacks similar to the one which decimated the supernodes • Since there were humans in the chain, the back-off state should have been sticky: persisted somewhere to prevent circumvention by constantly restarting of the client • When building a system where the same request might be retried by different agents, it is important to implement persistent global back-off to make sure that no operations are retried more frequently that permitted. Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 13
  • 14. 7 - Scalable fault-tolerant control plane • Build a 23-million-way Big Red Button - ability to instantly control the "Flash Crowd". • Most distributed systems focus on scaling the data plane and assume control plane is insignificant • Skype’s control plane for the client relied on relay by the supernodes and was effectively disabled when the cloud disintegrated – Backing it up with a simple RSS-style command feed would have made it possible to control the cloud even in dissipated state. Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 14
  • 15. 8 - Fault injection testing • Skype's problem originated and became critical in the parts of the system that were dealing with failures of other components. • This is very typical – often 50% of the code and overall complexity is dedicated to fault handling. • This code is extremely difficult to test outside rudimentary white-box unit tests, so in most cases it remains untested. – Almost never regression tested, making it the most stale and least reliable part of the system – Is invoked when the system is experiencing problems Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 15
  • 16. 8 - Fault injection testing framework • Requires an on-demand fault injection framework. • Framework intercepts and controls all communications between components and layers. • Exposes API to simulate all conceivable kinds of failures: – total and intermittent component outages – communication failures – SLA violations Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 16
  • 17. 9 - Meaningful failure messages • Throughout the event, both the client and central site were reporting assorted problems that were often unrelated to what was actually happening: – e.g. at some point my Skype client complained that my credentials were incorrect • Leverage crowdsourcing: – clear and relevant error messages – easy ways for users to report problems – real-time aggregation – secondary monitoring and alerting network Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 17
  • 18. 10 - Communication to the end users • Efficient, timely and honest communication to the end users is the only way to run Dependable Systems • Dedicated Status site for humans • Status APIs and Feeds Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 18
  • 19. 11 - Separate command, control and recovery infrastructure • Have a physically separate and logically independent emergency command, control and recovery infrastructure • Catalog technologies, systems and dependencies used to detect faults, build, deploy and (re)start applications, communicate status • Avoid circular dependencies • Use separate implementations or old stable versions Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 19
  • 20. 11 - Separate command, control and recovery infrastructure - blueprint All Other Systems Command, Control and Recovery Infrastructure Manage us e Monitoring & ge Control Standalone Data Cloud na Ma Systems Ma use int ain e us Data Cloud Maintain Build & Deploy Systems Binary Content e ag n an ai use nt M ai M Standalone Control & Recovery Stack Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 20
  • 21. Cloud Outage Redux On April 21st 2011 operator error during maintenance caused an outage of the Amazon Elastic Block Store (“EBS”) Service, which in turn brought down parts of the Amazon Elastic Compute Cloud (“EC2”) and Amazon Relational Database Service (RDS) Services. The outage lasted 54 hours and data recovery took several more days and 0.07% of the EBS volumes were lost. The outage affected over 100 sites, including big names like Reddit, Foursquare and Moby Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 21
  • 22. EC2 Outage Sequence • An operator error occurs during routine maintenance operation • Production traffic routed into low capacity backup network • Split brain occurs • When network is restored, system enters “re-mirroring storm” • EBS control plane overwhelmed • Dependent services start failing Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 22
  • 23. Applicable Lessons 1. Pervasive Monitoring 2. Early warning systems 3. Graceful degradation 4. Contagious failure awareness 5. Design for failures 6. Fail-fast and exponential back-off 7. Scalable and fault-tolerant control plane 8. Fault injection testing 9. Meaningful failure messages 10. Efficient, timely and honest communication to the end users 11. Separate command, control and recovery infrastructure Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 23
  • 24. Further Reading and contact Information Skype Outage Postmortem - http://blogs.skype.com/en/2010/12/cio_update.htm l Amazon Outage Postmortem - http://aws.amazon.com/message/65648/ Recovery-Oriented Computing - http://roc.cs.berkeley.edu/roc_overview.html Designing and Deploying Internet-Scale Services, James Hamilton - http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa_DesigningServices.pptx Design Patterns for Graceful Degradation, Titos Saridakis - http://www.springerlink.com/content/m7452413022t53w1/ Alex Maclinovsky blogs at http://randomfour.com/blog/ and can be reached via: alexey.v.maclinovsky@accenture.com Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved. 24
  • 25. Questions & Answers Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.