Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

Enhancing the Analysis of Software Failures in
Cloud Computing Systems with Deep Learning
Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella
DIETI, Università degli Studi di Napoli Federico II, Italy
{cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it
The 32nd International Symposium on Software Reliability Engineering

ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 2
Cloud Computing Infrastructure
 Analyzing how faults can turn into service failures (Failure
Mode Analysis) is very difficult and time-consuming, even for
expert developers
• Huge volumes of data (hundreds of MBs, thousands of events)
• Large number of fault experiments
• High complexity, non-determinism
X
Faults
Storage, network,
software, etc.
Sys. admins
Failures
Data loss, resource
unavailable, etc.
IaaS
Service
requests
Clients
Failure Data

Our case study: OpenStack
Nova
Horizon
Cinder Neutron
Glance
Keystone
Swift
instance
creation
request
Silent failures occur as
omissions, delays, or out-of-
order events in these workflows
auth-token
validation
get image id
get IP
address
volume
attachment

Events in Fault-Injection Experiments

Contribution
 A novel approach for discovering the classes of
failure ("failure modes") of cloud computing systems,
using fault injection and deep learning
 Case study on a dataset of thousands of failures of
the OpenStack cloud computing platform
 The raw failure data (logs, event traces) are clustered into
few failure modes (ease of interpretation by developers and
sysadmins)

Contribution (cont.)
 The failure dataset containing the events collected in
OpenStack during our fault-injection experiments is
publicly available on GitHub:
https://github.com/dessertlab/Failure-Dataset-OpenStack
 The paper is available on ScienceDirect:
https://doi.org/10.1016/j.jss.2021.111043

Failure Mode Analysis Based on Plain
Sequences of Events
Vector
representation
Node
Node
Node
Traces under fault-
injected conditions
Execution with fault-
injection
1
Instrumentation
2
1
3
2
Instrumented
communication libraries
(REST APIs, Message
Queues, …)
Clustering
4
3
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
5
AACABBA
Occurrence vector
<A = 4, B = 2, C = 1>
Clusters of failure
modes
Example: the events A, B, C happened
4, 2 and 1 times, respectively, during
the failure

Anomaly
Detection
Node
Node
Node
Traces under fault-
injected conditions
Traces under fault-
free conditions
injection
2
1
Instrumentation
3
1
3
2
Instrumented
(REST APIs, Message
Queues, …)
Fault-free execution
Clustering
6
Model training of
normal behavior
4
5 AACABBA
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
7
Anomaly vector
spurious anomalies
< A = 1, B = 0, C = 1,
A = 0, B = 2, C = 1 >
missing anomalies
Clusters of failure
modes
AABBBBCA
AABBBABCC
AABBABBC
Failure Mode Analysis Based on
Anomaly Detection
Cotroneo, Domenico, et al. "Enhancing failure propagation analysis in cloud computing
systems." 2019 IEEE 30th International Symposium on Software Reliability Engineering
(ISSRE). IEEE, 2019.

Proposed Solution:
Deep Embedded Clustering (DEC)
Vector representation
Node
Node
Node
Traces under fault-
injected conditions
injection
1
Instrumentation
2
1
3
2
Instrumented
(REST APIs, Message
Queues, …)
Autoencoder
4
3
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
6
Clusters of failure
modes
Clustering
Cluster
Layer
Encoder
embedded
features
5
Encoder Decoder
Reconstruction
Error
This solution can be used also in
combination with anomaly detection, by
applying it on anomaly vectors

Experiments
 2,538 fault-injection experiments in OpenStack cloud
computing:
• 4 fault-types
• 3 workloads (DEPL, NET, STO)
Failure Mode DEPL NET STO
Instance Failure 224 56 320
Volume Failure 151 - 38
Network Failure 52 30 -
SSH Failure 41 176 -
Cleanup Failure 69 - 157
No Failure 539 299 386
Ground Truth

Clustering without Anomaly Detection
Workload
Clustering
Approach
DEPL NET STO
k-medoids w/o fine-
tuning
0.70 0.80 0.80
k-medoids with
fine-tuning
0.74 0.85 0.82
DEC 0.86 0.86 0.92
DEC achieves clusters with higher purity
compared to traditional clustering, both without
and with manual fine-tuning of feature weights

Clustering with Anomaly Detection
Workload
Clustering
Approach
DEPL NET STO
k-medoids w/o fine-
tuning
0.80 0.78 0.87
k-medoids with
fine-tuning
0.94 0.86 0.90
DEC 0.84 0.83 0.89
DEC approaches the performance of manually-
tuned clustering with anomaly detection

Failure Modes Distribution
0
200
400
600
800
1000
1200
1400
1600
1800
Instance
Failure
Volume
Failure
Network
Failure
SSH Failure Cleanup
Failure
No Failure
Ground Truth k-medoids k-med with fine-tuning DEC

Conclusion
 We presented a novel approach for analyzing failure
data from cloud systems, by using unsupervised
learning algorithms and deep learning
 We presented results on failure data from the popular
OpenStack cloud computing platform
• The approach can achieve performance comparable to, or in
some cases even better than, the performance of manually-
tuned clustering
• The approach performs better than unsupervised clustering
w/o feature engineering

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

More Related Content

What's hot

Similar to Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

Recently uploaded

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning