Cloud API Issues: an Empirical Study and Impact

NICTA Copyright 2012 From imagination to impact
Cloud API Issues: an Empirical
Study and Impact
Qinghua Lu, Liming Zhu, Len Bass, Xiwei Xu,
Zhanwen Li, Hiroshi Wada
Software Systems Research Group, NICTA
QoSA13, Vancouver
Slides at: http://www.slideshare.net/LimingZhu/

Motivation
• Cloud applications fail due to operation issues
– Gartner reports: 80% of outage caused by operations
• People/Process: replication/failover, auto-scaling, upgrade…
– Lessons from our own cloud DR product: Yuruware.com
– DevOps movement
• Operational causes of failures
– Infrastructure and processes, but,
– Most things are done through infrastructure API
• Highly dependable cloud applications require
– Architecting for not just the software but also its operation (thru API)
– Architecting for indirect control (thru API)
– Better understanding of Cloud API Issues
• reliability, performance, nature of failures and faults 2

Main Contributions
• Empirical study of cloud infrastructure API issues
– 922 failure/fault cases from Amazon EC2 forums (2010 to 2012)
• Around five most used API calls
• Fault analysis supplemented by other sources
– Classified the API failures and faults (causes of failures)
• Using the classic dependable computing taxonomy (Avizienis,04)
• Failures: content, late timing, halt, erratic
• Faults: development, physical, interaction
• Impact analysis through an initial proposal for tolerating
cloud API failures/faults
– Suggestions for tolerating content failures
– 11 patterns for tolerating timing failures
3

Some Empirical Findings
• Majority (60%) of the cases of API failures are related to stuck API
calls or unresponsive API calls
• 19% of the cases are related to the output issues of API calls
– Error messages, missing/wrong/unexpected contents
• 12% of the cases are about slow responsive API calls
• 9% cases are related to API calls that
– were pending for a certain time and then returned to the original state
without informing the caller properly
– were reported to be successful first but failed later
4

Methodology
5
• Amazon: EC2 forums and outage reports
• Netflix: technical blogs and GitHub OSS projects
• Yuruware.com: disaster recovery product which heavily relies on cloud
infrastructure APIs

Data Collected from Amazon EC2 Forum
6
Searched keywords and number of returned records
API of Interests Number of records
from inception to 2012
Number of records
from 2010 to 2012
describe instance 283 150
start instance 227 204
stop instance 349 348
detach volume 235 203
associate elastic IP 264 204
Total 1358 1109
Case type, number and percentage in the found cases
Case type Case number Percentage of all cases
from 2010-2012
API failures 922 83%
Enquiries 125 11%
API enhancements 62 6%

Classification of API Failures
7
Fault -> Error -> Failure
Failure: deviation from correct
service (external visible)
Error: internal erroneous state
Fault: adjudicated or hypothesized
causes of a failure
[13] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, "Basic concepts and
taxonomy of dependable and secure computing," Dependable and Secure
Computing, IEEE Transactions on, vol. 1, pp. 11-33, 2004.

• Content failures (19%)
– With error messages; missing/wrong/unexpected content
• 61% of the times users understood the causes/solutions from the error message
• 39% of the times users could not pinpoint the causes from the error message
8
Posted on Jan 10, 2012 5:42 AM
Symptom: When a user tried to start an instance, the operation failed with an unclear error
message.
Error message: State Transition Reason - Server.InternalError: Internal error on launch
Root cause: Unknown.
Solution: AWS engineers advised detaching the EBS volume from the instance and attaching it to
another running instance.
Posted on Jun 14, 2012 9:57PM
Symptom: Failed API calls and receiving Request limit exceeded error message.
Error message: Client.RequestLimitExceeded: Request limit exceeded
Root cause: API calls exceeded limit.
Solution: N/A. There is no official information on the limit or the time span on which the limit is
calculated or suggested wait time.
Failed call where the error message is unclear.
Failed call where the error message is clear.

• Late timing failures (12%)
– the arrival time of the delivered information deviates from the
expected time but they do eventually arrive
9
A late timing failure example.
Posted on Aug 27, 2012 11:57 AM
Symptom: It took 16 minutes for an instance
to stop.
Root cause: n/a.
Solution: The AWS engineer advised to try
“force stop” twice if this happens next time.

• Halt failures (60%)
– The external state becomes constant.
– Most frequent failures!
10
A general halt failure example.
Posted on Jun 27, 2012 12:04 AM
Symptom: A user reported that the instance is stuck at stopping and “force stop” would not help.
Root cause: n/a.
Solution: The AWS engineer stopped the instance for the user on the AWS side (with some side
effect).
A silent failure example.
Posted on Oct 23, 2012 7:45 AM
Symptom: An instance was not accessible and the user could not stop/start it or create a snapshot
Root cause: AWS outage.
Solution: The AWS engineer advised that the user must launch a replacement instance from a pre-
existing backup (EBS AMI). Attempts to stop an inaccessible instance will likely result in an instance
becoming stuck in the stopping state. Customers that do not have a known good backup must wait
for the issue to be resolved for their instance connectivity to be restored.

• Erratic failures (35%)
– When the delivered service is unpredictable: Two subtypes:
• the call is pending for a certain time and then returns to the original state
• the call is successfully executed first but failed eventually
11
Two erratic failure examples.
Posted on Feb 1, 2012 8:15 AM
Symptom: A user associated an elastic IP with an instance and could SSH into the instance with the
elastic IP. After a few minutes, the elastic IP was silently disassociated from the instance.
Root cause: An issue with the underlying host.
Solution: The AWS engineer advised that the quickest fix was to stop and then start the instance to
relocate to a different host.
Posted on Jan 14, 2011 1:43 PM
Symptom: A user tried to start the instance several times. It indicated that the status is pending and
it goes back to stop.
Root cause: n/a.
Solution: The AWS engineer returned the user’s EBS volume to the available state and believed this
would resolve the user’s problem.

Classifying of Faults (Causes of Failures)
• Development faults – software bugs
– User workarounds exist but may break after bug fixing
• Physical faults
– Stopping/Starting to move to a new physical machine but
problematic stopping
– Future work: classifying using virtual resource characteristics
• Interaction faults
– Misconfiguration faults count for 30%
• Accidental & purposeful misconfiguration
– Purposeful misconfiguration
• lack of knowledge (subjective uncertainty vs. stochastic uncertainty)
• Configuration and operation impact on availability 1,2
1. X. Xu, Q. Lu, L. Zhu, et al., "Availability Analysis of In-Cloud Applications," in ISARCS13 (11:30
tomorrow)
2. Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud Application
Deployment Decisions for Availability," in IEEE Cloud 2013 12

Tolerating API Failures/Faults
13
• Perspective
– cloud consumer and application oriented
– limited visibility: e.g. may not know the root cause
– indirect control: e.g. solutions are thru APIs as well
• Different failures/faults require different approaches
– Failure/Fault classification dependent
– Suggestions, patterns and ad-hoc use of failure/fault characteristics:
• Content failure: alternative sources for content, defensive programming…
• Late timing failures: API call life cycle driven

API Call Life Cycle Driven Patterns
14

Pattern Examples
• Faster forced fail/complete
– force-fail-r or force-fail-s
• Netflix Hystrix: fail fast based on 95-99 percentile delay
– force-complete-r
• Yuruware: ignore some “describe” API calls
• Hedged requests or more sophisticated retry
– continue-request
• Common: send the same request to 2 places and cancel the slow one
– reallocate or reallocate-s
• Yuruware: attach the to-be-moved volume to different mover instances
after early mover failures
15

NICTA Copyright 2012 From imagination to impact 16
Conclusion and Future Work
• Empirical study of cloud infrastructure API issues
– Analysed & classified 922 failure/faults from Amazon EC2 forums
• Inform better architecting for operations (i.e. operator as a stakeholder)
– Future work (completed)
• Expanded to more cases from other sources (2087 issues)
• Proposed a new scheme for classifying faults
• Tolerating cloud API failures/faults
– Patterns for tolerating different types of API failures/faults
– Future work (ongoing)
• More actionable mechanisms/patterns and their implementation
• Use the characteristics of the faults and failures
– for smarter recovery and error diagnosis during operation
• What we need: more real world operation logs and collaborators
{Liming.Zhu, Len.Bass}@nicta.com.au
Slides available at http://www.slideshare.net/LimingZhu/

Cloud API Issues: an Empirical Study and Impact

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Cloud API Issues: an Empirical Study and Impact

Similar to Cloud API Issues: an Empirical Study and Impact (20)

More from Liming Zhu

More from Liming Zhu (19)

Recently uploaded

Recently uploaded (20)

Cloud API Issues: an Empirical Study and Impact