ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures
1. Towards Runtime Verification via Event
Stream Processing in Cloud Computing
Infrastructures
Domenico Cotroneo, Luigi De Simone, Pietro Liguori,
Roberto Natella, and Angela Scibelli
DIETI, Università degli Studi di Napoli Federico II, Italy
{cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it
ang.scibelli@studenti.unina.it
International Workshop on Artificial Intelligence for IT Operations
2. AIOPS, 14 December 2020 pietro.liguori@unina.it - 2
Problem: The fragility of cloud
computing infrastructure software
Gunawi et al., 2016. “Why Does the Cloud Stop Computing?
Lessons from Hundreds of Service Outages”. In Proc. SoCC
3. AIOPS, 14 December 2020 pietro.liguori@unina.it - 3
Cloud Computing Infrastructure
Adopted in critical domains (telecom, healthcare, etc.)
Strict availability requirements ("five nines")
High complexity, non-determinism
At risk due to undetected failures (long MTTR, poor QoS, etc.)
X
Faults
Storage, network,
software, ...
Sys. admins
Failures
(data loss, resource
unavailable, etc.)
IaaSService
requests
Clients
Lack of
failure
notifications
4. AIOPS, 14 December 2020 pietro.liguori@unina.it - 4
Our case study: OpenStack
Nova
Horizon
Cinder NeutronGlance
Keystone
Swift
instance
creation
request
Silent failures occur as
omissions, delays, or out-of-
order events in these workflows
auth-token
validation
get image id
get IP
address
volume
attachment
5. AIOPS, 14 December 2020 pietro.liguori@unina.it - 5
Contribution
Generalizable approach for runtime detection of
failures in cloud computing systems
• Black-box tracing
• Stream-based Runtime Verification
• Lightweight Monitoring Rules
Evaluation of the approach in OpenStack
• Fault-injection campaign (481 experiments with a failure)
• Intensive workload stressing the three most important
OpenStack subsystems (Nova, Cinder, Neutron)
• Evaluation of the monitoring rules, in terms of Failure
Detection Coverage (FDC), in both single user and multi-
user scenarios
6. AIOPS, 14 December 2020 pietro.liguori@unina.it - 6
Ideal Tracing
Invariants: properties that hold
over events in an execution
• E.g., "buy" from client must be
preceded by "available" from server
Difficult to apply in practice
• Needs the happened-before relation
between events, using vector clocks
or by propagating a session ID
A(id=1) B(id=1)
A(id=2) B(id=2)
…
A1
A2 B2
B1
7. AIOPS, 14 December 2020 pietro.liguori@unina.it - 7
Black box tracing
Nova
Horizon
Cinder NeutronGlance
Keystone
Swift
User 1 User 2
Event A
Event A
Event B
Event B
Timeline
Communication APIs
(REST APIs, Message Queues)
∀ 𝒂 ∈ 𝑨 ⇒ ∃ 𝒃 ∈ 𝑩: 𝒂 → 𝒃
A A B B
8. AIOPS, 14 December 2020 pietro.liguori@unina.it - 8
Black box tracing (cont.)
Nova
Horizon
Cinder NeutronGlance
Keystone
Swift
A C C C
Communication APIs
(REST APIs, Message Queues)
Timeline
Event A
Event C
Event C
Event C
𝑪 < 𝒎𝒂𝒙𝑪𝒐𝒖𝒏𝒕 𝑪
9. AIOPS, 14 December 2020 pietro.liguori@unina.it - 9
Approach Overview
Node 1
Node 3
Node 2
Communication APIs
(REST APIs, MQs)
Stream
Processor
RV Process
Instrumentation
Fault-free
traces
Lightweight
Monitoring Rules
Monitor
Synthesis
Analysis
Collection of
correct executions
1
2
3
4
5
A A B B
Events
10. AIOPS, 14 December 2020 pietro.liguori@unina.it - 10
Ordering-based Rules
Add_Volume1: event A (name="compute_reserve_block_device_name") is
eventually followed by event B (name="compute_attach_volume");
Add_Volume2: event A (name="compute_attach_volume") is eventually
followed by event B (name="cinder-
volume.localhost.localdomain@lvm_initialize_connection")
Add_Volume3: event sequence of Rule Add_Volume2 is eventually
followed by event C (name="cinder-
volume.localhost.localdomain@lvm_attach_volume")
Rule FDC %
Add_Volume1 26.67
Add_Volume2 11.66
Add_Volume3 51.67
Total 90.00
11. AIOPS, 14 December 2020 pietro.liguori@unina.it - 11
Counting-based Rules
type = C?
|C| >
maxCount(C)
?
Failure Detection
Message
Events from
System under test
Events type C
Rule FDC %
SSH_Failure1 27.07
SSH_Failure2 15.38
SSH_Failure3 7.69
Total 38.46
YES
YES
12. AIOPS, 14 December 2020 pietro.liguori@unina.it - 12
Comparison with API Errors Coverage
Target
System
Failure
Type
OpenStack
FDC%
RV
FDC%
Cinder Volume Creation Fail 29.67 28.57
Cinder Volume Attachment Fail 25.33 92.00
Cinder Volume Deletion Fail 100 100
Nova Instance Creation Fail 0.00 90.96
Neutron SSH Connection Fail 0.00 38.46
OpenStack Total 23.96 79.38
14. AIOPS, 14 December 2020 pietro.liguori@unina.it - 14
Conclusion
Generalizable approach for runtime detection of
failures in cloud computing systems
• Portable, low intrusiveness
Evaluation of the approach in OpenStack
• Definition of lightweight failure detection rules for Nova,
Cinder and Neutron subsystems
• RV failure detection coverage >> OpenStack API Errors
coverage
Future work
• Algorithm to identify patterns using statistical analysis
techniques
• Evaluation in a real multi-user scenario