NICTA Copyright 2012 From imagination to impactCloud API Issues: an EmpiricalStudy and ImpactQinghua Lu, Liming Zhu, Len B...
NICTA Copyright 2012 From imagination to impactMotivation• Cloud applications fail due to operation issues– Gartner report...
NICTA Copyright 2012 From imagination to impactMain Contributions• Empirical study of cloud infrastructure API issues– 922...
NICTA Copyright 2012 From imagination to impactSome Empirical Findings• Majority (60%) of the cases of API failures are re...
NICTA Copyright 2012 From imagination to impactMethodology5• Amazon: EC2 forums and outage reports• Netflix: technical blo...
NICTA Copyright 2012 From imagination to impactData Collected from Amazon EC2 Forum6Searched keywords and number of return...
NICTA Copyright 2012 From imagination to impactClassification of API Failures7Fault -> Error -> FailureFailure: deviation ...
NICTA Copyright 2012 From imagination to impactClassification of API Failures• Content failures (19%)– With error messages...
NICTA Copyright 2012 From imagination to impactClassification of API Failures• Late timing failures (12%)– the arrival tim...
NICTA Copyright 2012 From imagination to impactClassification of API Failures• Halt failures (60%)– The external state bec...
NICTA Copyright 2012 From imagination to impactClassification of API Failures• Erratic failures (35%)– When the delivered ...
NICTA Copyright 2012 From imagination to impactClassifying of Faults (Causes of Failures)• Development faults – software b...
NICTA Copyright 2012 From imagination to impactTolerating API Failures/Faults13• Perspective– cloud consumer and applicati...
NICTA Copyright 2012 From imagination to impactAPI Call Life Cycle Driven Patterns14
NICTA Copyright 2012 From imagination to impactPattern Examples• Faster forced fail/complete– force-fail-r or force-fail-s...
NICTA Copyright 2012 From imagination to impact 16Conclusion and Future Work• Empirical study of cloud infrastructure API ...
Upcoming SlideShare
Loading in …5
×

Cloud API Issues: an Empirical Study and Impact

972 views
873 views

Published on

Quality of Software Architecture (QoSA) 2013 talk slides. June 18th, 2013.
Full paper at http://www.nicta.com.au/pub?doc=6785

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
972
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Cloud API Issues: an Empirical Study and Impact

  1. 1. NICTA Copyright 2012 From imagination to impactCloud API Issues: an EmpiricalStudy and ImpactQinghua Lu, Liming Zhu, Len Bass, Xiwei Xu,Zhanwen Li, Hiroshi WadaSoftware Systems Research Group, NICTAQoSA13, VancouverSlides at: http://www.slideshare.net/LimingZhu/
  2. 2. NICTA Copyright 2012 From imagination to impactMotivation• Cloud applications fail due to operation issues– Gartner reports: 80% of outage caused by operations• People/Process: replication/failover, auto-scaling, upgrade…– Lessons from our own cloud DR product: Yuruware.com– DevOps movement• Operational causes of failures– Infrastructure and processes, but,– Most things are done through infrastructure API• Highly dependable cloud applications require– Architecting for not just the software but also its operation (thru API)– Architecting for indirect control (thru API)– Better understanding of Cloud API Issues• reliability, performance, nature of failures and faults 2
  3. 3. NICTA Copyright 2012 From imagination to impactMain Contributions• Empirical study of cloud infrastructure API issues– 922 failure/fault cases from Amazon EC2 forums (2010 to 2012)• Around five most used API calls• Fault analysis supplemented by other sources– Classified the API failures and faults (causes of failures)• Using the classic dependable computing taxonomy (Avizienis,04)• Failures: content, late timing, halt, erratic• Faults: development, physical, interaction• Impact analysis through an initial proposal for toleratingcloud API failures/faults– Suggestions for tolerating content failures– 11 patterns for tolerating timing failures3
  4. 4. NICTA Copyright 2012 From imagination to impactSome Empirical Findings• Majority (60%) of the cases of API failures are related to stuck APIcalls or unresponsive API calls• 19% of the cases are related to the output issues of API calls– Error messages, missing/wrong/unexpected contents• 12% of the cases are about slow responsive API calls• 9% cases are related to API calls that– were pending for a certain time and then returned to the original statewithout informing the caller properly– were reported to be successful first but failed later4
  5. 5. NICTA Copyright 2012 From imagination to impactMethodology5• Amazon: EC2 forums and outage reports• Netflix: technical blogs and GitHub OSS projects• Yuruware.com: disaster recovery product which heavily relies on cloudinfrastructure APIs
  6. 6. NICTA Copyright 2012 From imagination to impactData Collected from Amazon EC2 Forum6Searched keywords and number of returned recordsAPI of Interests Number of recordsfrom inception to 2012Number of recordsfrom 2010 to 2012describe instance 283 150start instance 227 204stop instance 349 348detach volume 235 203associate elastic IP 264 204Total 1358 1109Case type, number and percentage in the found casesCase type Case number Percentage of all casesfrom 2010-2012API failures 922 83%Enquiries 125 11%API enhancements 62 6%
  7. 7. NICTA Copyright 2012 From imagination to impactClassification of API Failures7Fault -> Error -> FailureFailure: deviation from correctservice (external visible)Error: internal erroneous stateFault: adjudicated or hypothesizedcauses of a failure[13] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, "Basic concepts andtaxonomy of dependable and secure computing," Dependable and SecureComputing, IEEE Transactions on, vol. 1, pp. 11-33, 2004.
  8. 8. NICTA Copyright 2012 From imagination to impactClassification of API Failures• Content failures (19%)– With error messages; missing/wrong/unexpected content• 61% of the times users understood the causes/solutions from the error message• 39% of the times users could not pinpoint the causes from the error message8Posted on Jan 10, 2012 5:42 AMSymptom: When a user tried to start an instance, the operation failed with an unclear errormessage.Error message: State Transition Reason - Server.InternalError: Internal error on launchRoot cause: Unknown.Solution: AWS engineers advised detaching the EBS volume from the instance and attaching it toanother running instance.Posted on Jun 14, 2012 9:57PMSymptom: Failed API calls and receiving Request limit exceeded error message.Error message: Client.RequestLimitExceeded: Request limit exceededRoot cause: API calls exceeded limit.Solution: N/A. There is no official information on the limit or the time span on which the limit iscalculated or suggested wait time.Failed call where the error message is unclear.Failed call where the error message is clear.
  9. 9. NICTA Copyright 2012 From imagination to impactClassification of API Failures• Late timing failures (12%)– the arrival time of the delivered information deviates from theexpected time but they do eventually arrive9A late timing failure example.Posted on Aug 27, 2012 11:57 AMSymptom: It took 16 minutes for an instanceto stop.Root cause: n/a.Solution: The AWS engineer advised to try“force stop” twice if this happens next time.
  10. 10. NICTA Copyright 2012 From imagination to impactClassification of API Failures• Halt failures (60%)– The external state becomes constant.– Most frequent failures!10A general halt failure example.Posted on Jun 27, 2012 12:04 AMSymptom: A user reported that the instance is stuck at stopping and “force stop” would not help.Root cause: n/a.Solution: The AWS engineer stopped the instance for the user on the AWS side (with some sideeffect).A silent failure example.Posted on Oct 23, 2012 7:45 AMSymptom: An instance was not accessible and the user could not stop/start it or create a snapshotRoot cause: AWS outage.Solution: The AWS engineer advised that the user must launch a replacement instance from a pre-existing backup (EBS AMI). Attempts to stop an inaccessible instance will likely result in an instancebecoming stuck in the stopping state. Customers that do not have a known good backup must waitfor the issue to be resolved for their instance connectivity to be restored.
  11. 11. NICTA Copyright 2012 From imagination to impactClassification of API Failures• Erratic failures (35%)– When the delivered service is unpredictable: Two subtypes:• the call is pending for a certain time and then returns to the original state• the call is successfully executed first but failed eventually11Two erratic failure examples.Posted on Feb 1, 2012 8:15 AMSymptom: A user associated an elastic IP with an instance and could SSH into the instance with theelastic IP. After a few minutes, the elastic IP was silently disassociated from the instance.Root cause: An issue with the underlying host.Solution: The AWS engineer advised that the quickest fix was to stop and then start the instance torelocate to a different host.Posted on Jan 14, 2011 1:43 PMSymptom: A user tried to start the instance several times. It indicated that the status is pending andit goes back to stop.Root cause: n/a.Solution: The AWS engineer returned the user’s EBS volume to the available state and believed thiswould resolve the user’s problem.
  12. 12. NICTA Copyright 2012 From imagination to impactClassifying of Faults (Causes of Failures)• Development faults – software bugs– User workarounds exist but may break after bug fixing• Physical faults– Stopping/Starting to move to a new physical machine butproblematic stopping– Future work: classifying using virtual resource characteristics• Interaction faults– Misconfiguration faults count for 30%• Accidental & purposeful misconfiguration– Purposeful misconfiguration• lack of knowledge (subjective uncertainty vs. stochastic uncertainty)• Configuration and operation impact on availability 1,21. X. Xu, Q. Lu, L. Zhu, et al., "Availability Analysis of In-Cloud Applications," in ISARCS13 (11:30tomorrow)2. Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud ApplicationDeployment Decisions for Availability," in IEEE Cloud 2013 12
  13. 13. NICTA Copyright 2012 From imagination to impactTolerating API Failures/Faults13• Perspective– cloud consumer and application oriented– limited visibility: e.g. may not know the root cause– indirect control: e.g. solutions are thru APIs as well• Different failures/faults require different approaches– Failure/Fault classification dependent– Suggestions, patterns and ad-hoc use of failure/fault characteristics:• Content failure: alternative sources for content, defensive programming…• Late timing failures: API call life cycle driven
  14. 14. NICTA Copyright 2012 From imagination to impactAPI Call Life Cycle Driven Patterns14
  15. 15. NICTA Copyright 2012 From imagination to impactPattern Examples• Faster forced fail/complete– force-fail-r or force-fail-s• Netflix Hystrix: fail fast based on 95-99 percentile delay– force-complete-r• Yuruware: ignore some “describe” API calls• Hedged requests or more sophisticated retry– continue-request• Common: send the same request to 2 places and cancel the slow one– reallocate or reallocate-s• Yuruware: attach the to-be-moved volume to different mover instancesafter early mover failures15
  16. 16. NICTA Copyright 2012 From imagination to impact 16Conclusion and Future Work• Empirical study of cloud infrastructure API issues– Analysed & classified 922 failure/faults from Amazon EC2 forums• Inform better architecting for operations (i.e. operator as a stakeholder)– Future work (completed)• Expanded to more cases from other sources (2087 issues)• Proposed a new scheme for classifying faults• Tolerating cloud API failures/faults– Patterns for tolerating different types of API failures/faults– Future work (ongoing)• More actionable mechanisms/patterns and their implementation• Use the characteristics of the faults and failures– for smarter recovery and error diagnosis during operation• What we need: more real world operation logs and collaborators{Liming.Zhu, Len.Bass}@nicta.com.auSlides available at http://www.slideshare.net/LimingZhu/

×