Enhancing the Analysis of Software Failures in
Cloud Computing Systems with Deep Learning
Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella
DIETI, Università degli Studi di Napoli Federico II, Italy
{cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it
The 32nd International Symposium on Software Reliability Engineering
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 2
Cloud Computing Infrastructure
 Analyzing how faults can turn into service failures (Failure
Mode Analysis) is very difficult and time-consuming, even for
expert developers
• Huge volumes of data (hundreds of MBs, thousands of events)
• Large number of fault experiments
• High complexity, non-determinism
X
Faults
Storage, network,
software, etc.
Sys. admins
Failures
Data loss, resource
unavailable, etc.
IaaS
Service
requests
Clients
Failure Data
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 3
Our case study: OpenStack
Nova
Horizon
Cinder Neutron
Glance
Keystone
Swift
instance
creation
request
Silent failures occur as
omissions, delays, or out-of-
order events in these workflows
auth-token
validation
get image id
get IP
address
volume
attachment
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 4
Events in Fault-Injection Experiments
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 5
Contribution
 A novel approach for discovering the classes of
failure ("failure modes") of cloud computing systems,
using fault injection and deep learning
 Case study on a dataset of thousands of failures of
the OpenStack cloud computing platform
 The raw failure data (logs, event traces) are clustered into
few failure modes (ease of interpretation by developers and
sysadmins)
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 6
Contribution (cont.)
 The failure dataset containing the events collected in
OpenStack during our fault-injection experiments is
publicly available on GitHub:
https://github.com/dessertlab/Failure-Dataset-OpenStack
 The paper is available on ScienceDirect:
https://doi.org/10.1016/j.jss.2021.111043
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 7
Failure Mode Analysis Based on Plain
Sequences of Events
Vector
representation
Node
Node
Node
Traces under fault-
injected conditions
Execution with fault-
injection
1
Instrumentation
2
1
3
2
Instrumented
communication libraries
(REST APIs, Message
Queues, …)
Clustering
4
3
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
5
AACABBA
Occurrence vector
<A = 4, B = 2, C = 1>
Clusters of failure
modes
Example: the events A, B, C happened
4, 2 and 1 times, respectively, during
the failure
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 8
Anomaly
Detection
Node
Node
Node
Traces under fault-
injected conditions
Traces under fault-
free conditions
Execution with fault-
injection
2
1
Instrumentation
3
1
3
2
Instrumented
communication libraries
(REST APIs, Message
Queues, …)
Fault-free execution
Clustering
6
Model training of
normal behavior
4
5 AACABBA
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
7
Anomaly vector
spurious anomalies
< A = 1, B = 0, C = 1,
A = 0, B = 2, C = 1 >
missing anomalies
Clusters of failure
modes
AABBBBCA
AABBBABCC
AABBABBC
Failure Mode Analysis Based on
Anomaly Detection
Cotroneo, Domenico, et al. "Enhancing failure propagation analysis in cloud computing
systems." 2019 IEEE 30th International Symposium on Software Reliability Engineering
(ISSRE). IEEE, 2019.
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 9
Proposed Solution:
Deep Embedded Clustering (DEC)
Vector representation
Node
Node
Node
Traces under fault-
injected conditions
Execution with fault-
injection
1
Instrumentation
2
1
3
2
Instrumented
communication libraries
(REST APIs, Message
Queues, …)
Autoencoder
4
3
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
6
Clusters of failure
modes
Clustering
Cluster
Layer
Encoder
embedded
features
5
Encoder Decoder
Reconstruction
Error
This solution can be used also in
combination with anomaly detection, by
applying it on anomaly vectors
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 10
Experiments
 2,538 fault-injection experiments in OpenStack cloud
computing:
• 4 fault-types
• 3 workloads (DEPL, NET, STO)
Failure Mode DEPL NET STO
Instance Failure 224 56 320
Volume Failure 151 - 38
Network Failure 52 30 -
SSH Failure 41 176 -
Cleanup Failure 69 - 157
No Failure 539 299 386
Ground Truth
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 11
Clustering without Anomaly Detection
Workload
Clustering
Approach
DEPL NET STO
k-medoids w/o fine-
tuning
0.70 0.80 0.80
k-medoids with
fine-tuning
0.74 0.85 0.82
DEC 0.86 0.86 0.92
DEC achieves clusters with higher purity
compared to traditional clustering, both without
and with manual fine-tuning of feature weights
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 12
Clustering with Anomaly Detection
Workload
Clustering
Approach
DEPL NET STO
k-medoids w/o fine-
tuning
0.80 0.78 0.87
k-medoids with
fine-tuning
0.94 0.86 0.90
DEC 0.84 0.83 0.89
DEC approaches the performance of manually-
tuned clustering with anomaly detection
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 13
Failure Modes Distribution
0
200
400
600
800
1000
1200
1400
1600
1800
Instance
Failure
Volume
Failure
Network
Failure
SSH Failure Cleanup
Failure
No Failure
Ground Truth k-medoids k-med with fine-tuning DEC
ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 14
Conclusion
 We presented a novel approach for analyzing failure
data from cloud systems, by using unsupervised
learning algorithms and deep learning
 We presented results on failure data from the popular
OpenStack cloud computing platform
• The approach can achieve performance comparable to, or in
some cases even better than, the performance of manually-
tuned clustering
• The approach performs better than unsupervised clustering
w/o feature engineering

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

  • 1.
    Enhancing the Analysisof Software Failures in Cloud Computing Systems with Deep Learning Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella DIETI, Università degli Studi di Napoli Federico II, Italy {cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it The 32nd International Symposium on Software Reliability Engineering
  • 2.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 2 Cloud Computing Infrastructure  Analyzing how faults can turn into service failures (Failure Mode Analysis) is very difficult and time-consuming, even for expert developers • Huge volumes of data (hundreds of MBs, thousands of events) • Large number of fault experiments • High complexity, non-determinism X Faults Storage, network, software, etc. Sys. admins Failures Data loss, resource unavailable, etc. IaaS Service requests Clients Failure Data
  • 3.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 3 Our case study: OpenStack Nova Horizon Cinder Neutron Glance Keystone Swift instance creation request Silent failures occur as omissions, delays, or out-of- order events in these workflows auth-token validation get image id get IP address volume attachment
  • 4.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 4 Events in Fault-Injection Experiments
  • 5.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 5 Contribution  A novel approach for discovering the classes of failure ("failure modes") of cloud computing systems, using fault injection and deep learning  Case study on a dataset of thousands of failures of the OpenStack cloud computing platform  The raw failure data (logs, event traces) are clustered into few failure modes (ease of interpretation by developers and sysadmins)
  • 6.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 6 Contribution (cont.)  The failure dataset containing the events collected in OpenStack during our fault-injection experiments is publicly available on GitHub: https://github.com/dessertlab/Failure-Dataset-OpenStack  The paper is available on ScienceDirect: https://doi.org/10.1016/j.jss.2021.111043
  • 7.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 7 Failure Mode Analysis Based on Plain Sequences of Events Vector representation Node Node Node Traces under fault- injected conditions Execution with fault- injection 1 Instrumentation 2 1 3 2 Instrumented communication libraries (REST APIs, Message Queues, …) Clustering 4 3 FAIL #1 FAIL #3 FAIL #2 Visualization 5 AACABBA Occurrence vector <A = 4, B = 2, C = 1> Clusters of failure modes Example: the events A, B, C happened 4, 2 and 1 times, respectively, during the failure
  • 8.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 8 Anomaly Detection Node Node Node Traces under fault- injected conditions Traces under fault- free conditions Execution with fault- injection 2 1 Instrumentation 3 1 3 2 Instrumented communication libraries (REST APIs, Message Queues, …) Fault-free execution Clustering 6 Model training of normal behavior 4 5 AACABBA FAIL #1 FAIL #3 FAIL #2 Visualization 7 Anomaly vector spurious anomalies < A = 1, B = 0, C = 1, A = 0, B = 2, C = 1 > missing anomalies Clusters of failure modes AABBBBCA AABBBABCC AABBABBC Failure Mode Analysis Based on Anomaly Detection Cotroneo, Domenico, et al. "Enhancing failure propagation analysis in cloud computing systems." 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2019.
  • 9.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 9 Proposed Solution: Deep Embedded Clustering (DEC) Vector representation Node Node Node Traces under fault- injected conditions Execution with fault- injection 1 Instrumentation 2 1 3 2 Instrumented communication libraries (REST APIs, Message Queues, …) Autoencoder 4 3 FAIL #1 FAIL #3 FAIL #2 Visualization 6 Clusters of failure modes Clustering Cluster Layer Encoder embedded features 5 Encoder Decoder Reconstruction Error This solution can be used also in combination with anomaly detection, by applying it on anomaly vectors
  • 10.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 10 Experiments  2,538 fault-injection experiments in OpenStack cloud computing: • 4 fault-types • 3 workloads (DEPL, NET, STO) Failure Mode DEPL NET STO Instance Failure 224 56 320 Volume Failure 151 - 38 Network Failure 52 30 - SSH Failure 41 176 - Cleanup Failure 69 - 157 No Failure 539 299 386 Ground Truth
  • 11.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 11 Clustering without Anomaly Detection Workload Clustering Approach DEPL NET STO k-medoids w/o fine- tuning 0.70 0.80 0.80 k-medoids with fine-tuning 0.74 0.85 0.82 DEC 0.86 0.86 0.92 DEC achieves clusters with higher purity compared to traditional clustering, both without and with manual fine-tuning of feature weights
  • 12.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 12 Clustering with Anomaly Detection Workload Clustering Approach DEPL NET STO k-medoids w/o fine- tuning 0.80 0.78 0.87 k-medoids with fine-tuning 0.94 0.86 0.90 DEC 0.84 0.83 0.89 DEC approaches the performance of manually- tuned clustering with anomaly detection
  • 13.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 13 Failure Modes Distribution 0 200 400 600 800 1000 1200 1400 1600 1800 Instance Failure Volume Failure Network Failure SSH Failure Cleanup Failure No Failure Ground Truth k-medoids k-med with fine-tuning DEC
  • 14.
    ISSRE, October 25- 28, 2021 pietro.liguori@unina.it - 14 Conclusion  We presented a novel approach for analyzing failure data from cloud systems, by using unsupervised learning algorithms and deep learning  We presented results on failure data from the popular OpenStack cloud computing platform • The approach can achieve performance comparable to, or in some cases even better than, the performance of manually- tuned clustering • The approach performs better than unsupervised clustering w/o feature engineering