SlideShare a Scribd company logo
How Bad Can a Bug Get?
An Empirical Analysis of Software Failures
in the OpenStack Cloud Computing Platform
Domenico Cotroneo*, Luigi De Simone*, Pietro Liguori*,
Roberto Natella*, Nematollah Bidokhti**
*DIETI, Università degli Studi di Napoli Federico II, Italy
**Futurewei Technologies, Inc., USA
*{cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it **nbidokht@futurewei.com
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 2
Problem: The fragility of cloud
computing infrastructure software
Gunawi et al., 2016. “Why Does the Cloud Stop Computing?
Lessons from Hundreds of Service Outages”. In Proc. SoCC
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 3
Our case study: OpenStack
Nova
Horizon
Cinder NeutronGlance
Keystone
Swift
1. Failure notified by a
timely API error
(Fail-stop)
instance creation request
2. Log messages with CRITICAL
or ERROR severity
2019-08-27 15:13:20.106 ERROR nova.api.openstack.extensions
Unexpected exception in API method …
3. Failure is isolated
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 4
Contribution
 Empirical analysis of high-severity failures in the
OpenStack cloud computing platform:
RQ1: Are failures actually “fail-stop”?
RQ2: Are failures logged?
RQ3: Are failures propagated across sub-systems?
 Artifacts for reproducing our experimental
environment in a virtual machine:
 DOI: 10.6084/m9.figshare.8242877
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 5
Fault Injection Methodology
Workload Logs
 API Errors
- openstack instance create
 Assertion (Healthy) Checks
- Network Status: Active
- Instance Status: Error
OpenStack sub-systems Logs
2019-08-27 15:13:20.106 ERROR
nova.api.openstack.extensions
Unexpected exception in API method …
// ~/nova/compute/api.py
// ORIGINAL CODE
// self.compute_task_api.schedule_and_build_instances
(instanceID, build_parameters)
// BUGGY CODE (missing parameter)
self.compute_task_api.schedule_and_build_instances
(instanceID)
Workload
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 6
Overview of a fault injection experiment
iface_name = self.get_interface_name
(network, port)
Original
Python code
TIMELINE
Faulty
round
ON
Fault-free
round
OFF
Injected
Python code
if bug_trigger == True:
// BUGGY CODE (FAULTY ROUND)
// Missing Parameter MP
iface_name = self.get_interface_name
(network)
else:
// CORRECT CODE (FAUL-FREE ROUND)
iface_name = self.get_interface_name
(network, port)
Clean-up
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 7
0
5
10
15
20
25
Numberbugfixes
Fault type
API DICT SQL RPC SYSTEM AGENT/PLUGIN
We went through
problem reports on
Launchpad to identify
recurring bug-fixing
changes in OpenStack
Which bugs should we inject?
--- nova/virt/libvirt_conn.py 2011-01-25 12:44:26 +0000
+++ nova/virt/libvirt_conn.py 2011-01-25 20:42:26 +0000
@@ -1268,13 +1268,13 @@
if(ip_version == 4):
# Allow DHCP responses
dhcp_server = self._dhcp_server_for_instance(instance)
- our_rules += ['-A %s -s %s -p udp --sport 67 --dport 68' %
- (chain_name, dhcp_server)]
+ our_rules += ['-A %s -s %s -p udp --sport 67 --dport 68 '
+ '-j ACCEPT ' % (chain_name, dhcp_server)]
elif(ip_version == 6):
Sub-system
Fault type Nova Cinder Neutron ALL
MFC 110 55 36 201
WPV 60 40 36 136
MP 57 38 36 131
WRV 149 96 59 304
TE 63 40 36 139
ALL 439 269 203 911
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 8
Fail-stop Behavior
Add
Role
Create
Keypair
Create
Security
Group
Create
Router
Create
Networ
k
Create
Instance
Create
Floating IP
Create
Volume
Reboot
instance
Create
Image
Create
Domain
Create
Project
Create
User
Create
Subnetwork
Set
Gateway
Add
Floating IP
to Instance
Attach
Volume
to Instance
Cleanup
Resources
TIMELINE
API Error
openstack instance create
Workload
When an API call generates an
error, the workload is abortedAssertion Checks on the
status of the virtual resources
Network Status: Active
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 9
Non Fail-stop Behavior
API Error
Cannot 'attach_volume’ instanceID
while it is in vm_state error
Instance Status: Error
No API Error!
Failure delay
Workload
The workload continues the execution
regardless the assertion check(s)
Add
Role
Create
Keypair
Create
Security
Group
Create
Router
Create
Networ
k
Create
Instance
Create
Floating IP
Create
Volume
Reboot
instance
Create
Image
Create
Domain
Create
Project
Create
User
Create
Subnetwork
Set
Gateway
Add
Floating IP
to Instance
Attach
Volume
to Instance
Cleanup
Resources
TIMELINE
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 10
RQ1: Does OpenStack Show a Fail-Stop
Behavior?
40%
37%
23%
35%
46%
18%
60%
32%
7%
44%
38%
18%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
API Error Only Assertion Failure(s) & API
Error
Assertion Failure(s) Only
PercentageExperiments
Failure Type
Nova Cinder Neutron All sub-systems
Failures notified by
a timely API error
Failures with no API error
(but virtual resources are
in incorrect state)
Failures that were
notified with a delay
Fail-Stop Non Fail-Stop
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 11
RQ1: Does OpenStack Show a Fail-Stop
Behavior?
Subsystem Median
Latency [s]
Assertion
Failure(s)
followed by
API Error
(Non Fail-stop)
Nova 152.25
Cinder 74.52
Neutron 144.72
API Error Only
(Fail-stop)
Nova 3.73
Cinder 0.30
Neutron 0.30
Long API error latency
(2 minutes on average)
0 100 200 300 400
time (s)
0
0.2
0.4
0.6
0.8
1
Probability
Nova
Neutron
Cinder
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 12
RQ2: Is OpenStack Able to Log Failures?
 In 8.5% of experiments, no log messages with
CRITICAL or ERROR severity
Logging coverage
Subsystem API Errors Only
Assertion
Failure(s) and
API Errors
Assertion
Failure(s) Only
Nova 90.32% 82.56% 80.77%
Cinder 100% 100% 95.65%
Neutron 98.67% 95% 66.67%
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 13
8
Neutron
Injection in
Neutron
Injection in
Nova
Injection in
Cinder
Failure SSH
Failure Instance
Active
Failure Volume
Attached
Failure Volume
Created
Cinder API
Error
Nova API ErrorNeutron API
Error
Nova Cinder
RQ3: Do Failures Propagate Across
OpenStack?
Faulty Round
39
22
74
108
78
83
37
25
56
5555
The failures propagate across OpenStack services
in a significant amount of cases (37.5% of the failures)
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 14
RQ3: Do Failures Propagate Across
OpenStack?
Fault-Free Round
after fault removal
Neutron
Injection in
Neutron
Injection in
Nova
Injection in
Cinder
Failure SSH
Failure Instance
Active
Failure Volume
Attached
Failure Volume
Created
Cinder API
Error
Nova API ErrorNeutron API
Error
Nova Cinder
24
24
7
7
Persistent Failures
Even after that we disable the fault (fault-free round),
OpenStack still experiences failures (7.5% of the cases).
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 15
Conclusion (Answers) (1/2)
 RQ1: Are failures actually “fail-stop”?
 Answer: In the majority of the cases, OpenStack does not behave in a
«fail-stop» way (late or no API error)
 Suggestions: Mitigate failures by actively checking the status of virtual
resources as in our assertion checks (e.g., checks incorporated in a
monitoring solution)
 RQ2: Are failures logged?
 Answer: In a small fraction of the experiments, there was no indication
of the failure in the logs
 Suggestions: Improve logging in the source code (e.g., by checking for
errors returned by the faulty function calls)
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 16
 RQ3: Are failures propagated across sub-systems?
 Answer: In most of the failures, the injected bugs propagated across
several OpenStack sub-systems. There were also relevant cases of
failures that caused subtle residual effects on OpenStack
 Suggestions: Improve resource clean-up on errors, to prevent
propagation across service API calls and across subsystems.
Conclusion (Answers) (2/2)
Use our artifact to support future research
on mitigating the impact of software bugs
(DOI: 10.6084/m9.figshare.8242877)

More Related Content

What's hot

Under-reported Security Defects in Kubernetes Manifests
Under-reported Security Defects in Kubernetes ManifestsUnder-reported Security Defects in Kubernetes Manifests
Under-reported Security Defects in Kubernetes Manifests
Akond Rahman
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
Pôle Systematic Paris-Region
 
Predicting bugs using antipatterns
Predicting bugs using antipatternsPredicting bugs using antipatterns
Predicting bugs using antipatterns
Foutse Khomh
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talk
Abhik Roychoudhury
 
What Questions Do Programmers Ask About Configuration as Code?
What Questions Do Programmers Ask About Configuration as Code?What Questions Do Programmers Ask About Configuration as Code?
What Questions Do Programmers Ask About Configuration as Code?
Akond Rahman
 
SFScon 21 - Luigi Gubello - Security metrics for open-source projects
SFScon 21 - Luigi Gubello - Security metrics for open-source projectsSFScon 21 - Luigi Gubello - Security metrics for open-source projects
SFScon 21 - Luigi Gubello - Security metrics for open-source projects
South Tyrol Free Software Conference
 
Qualifying exam-2015-final
Qualifying exam-2015-finalQualifying exam-2015-final
Qualifying exam-2015-final
Open Networking Perú (Opennetsoft)
 
HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...
HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...
HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...
Mahmud Hossain
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are Dangerous
Raffael Marty
 
Shhh!: Secret Management Practices for Infrastructure as Code
Shhh!: Secret Management Practices for Infrastructure as Code Shhh!: Secret Management Practices for Infrastructure as Code
Shhh!: Secret Management Practices for Infrastructure as Code
Akond Rahman
 
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
DC2711 - DEF CON GROUP - Johannesburg
 
JavaSecure
JavaSecureJavaSecure
JavaSecure
SangbeomKim
 
The Finest Penetration Testing Framework for Software-Defined Networks
The Finest Penetration Testing Framework for Software-Defined NetworksThe Finest Penetration Testing Framework for Software-Defined Networks
The Finest Penetration Testing Framework for Software-Defined Networks
Priyanka Aash
 
ACSAC2020 "Return-Oriented IoT" by Kuniyasu Suzaki
ACSAC2020 "Return-Oriented IoT" by Kuniyasu SuzakiACSAC2020 "Return-Oriented IoT" by Kuniyasu Suzaki
ACSAC2020 "Return-Oriented IoT" by Kuniyasu Suzaki
Kuniyasu Suzaki
 
AI for Cybersecurity Innovation
AI for Cybersecurity InnovationAI for Cybersecurity Innovation
AI for Cybersecurity Innovation
Pete Burnap
 
Key Updating for Leakage Resiliency with Application to AES Modes of Operation
Key Updating for Leakage Resiliency with Application to AES Modes of OperationKey Updating for Leakage Resiliency with Application to AES Modes of Operation
Key Updating for Leakage Resiliency with Application to AES Modes of Operation
1crore projects
 

What's hot (16)

Under-reported Security Defects in Kubernetes Manifests
Under-reported Security Defects in Kubernetes ManifestsUnder-reported Security Defects in Kubernetes Manifests
Under-reported Security Defects in Kubernetes Manifests
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
 
Predicting bugs using antipatterns
Predicting bugs using antipatternsPredicting bugs using antipatterns
Predicting bugs using antipatterns
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talk
 
What Questions Do Programmers Ask About Configuration as Code?
What Questions Do Programmers Ask About Configuration as Code?What Questions Do Programmers Ask About Configuration as Code?
What Questions Do Programmers Ask About Configuration as Code?
 
SFScon 21 - Luigi Gubello - Security metrics for open-source projects
SFScon 21 - Luigi Gubello - Security metrics for open-source projectsSFScon 21 - Luigi Gubello - Security metrics for open-source projects
SFScon 21 - Luigi Gubello - Security metrics for open-source projects
 
Qualifying exam-2015-final
Qualifying exam-2015-finalQualifying exam-2015-final
Qualifying exam-2015-final
 
HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...
HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...
HSC-IoT: A Hardware and Software Co-Verification based Authentication Scheme ...
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are Dangerous
 
Shhh!: Secret Management Practices for Infrastructure as Code
Shhh!: Secret Management Practices for Infrastructure as Code Shhh!: Secret Management Practices for Infrastructure as Code
Shhh!: Secret Management Practices for Infrastructure as Code
 
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
Alexandre Borges - Advanced Malware: rootkits, .NET and BIOS/UEFI threats - D...
 
JavaSecure
JavaSecureJavaSecure
JavaSecure
 
The Finest Penetration Testing Framework for Software-Defined Networks
The Finest Penetration Testing Framework for Software-Defined NetworksThe Finest Penetration Testing Framework for Software-Defined Networks
The Finest Penetration Testing Framework for Software-Defined Networks
 
ACSAC2020 "Return-Oriented IoT" by Kuniyasu Suzaki
ACSAC2020 "Return-Oriented IoT" by Kuniyasu SuzakiACSAC2020 "Return-Oriented IoT" by Kuniyasu Suzaki
ACSAC2020 "Return-Oriented IoT" by Kuniyasu Suzaki
 
AI for Cybersecurity Innovation
AI for Cybersecurity InnovationAI for Cybersecurity Innovation
AI for Cybersecurity Innovation
 
Key Updating for Leakage Resiliency with Application to AES Modes of Operation
Key Updating for Leakage Resiliency with Application to AES Modes of OperationKey Updating for Leakage Resiliency with Application to AES Modes of Operation
Key Updating for Leakage Resiliency with Application to AES Modes of Operation
 

Similar to Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform"

Neutron Extension API
Neutron Extension APINeutron Extension API
IRJET- Web Application Firewall: Artificial Intelligence ARC
IRJET-  	  Web Application Firewall: Artificial Intelligence ARCIRJET-  	  Web Application Firewall: Artificial Intelligence ARC
IRJET- Web Application Firewall: Artificial Intelligence ARC
IRJET Journal
 
Dependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software BugsDependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software Bugs
Roberto Natella
 
IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...
IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...
IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...
IRJET Journal
 
ScaRR
ScaRRScaRR
20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment
Jonathan Blakes
 
Incorporation of IoT in Assembly Line Monitoring System
Incorporation of IoT in Assembly Line Monitoring SystemIncorporation of IoT in Assembly Line Monitoring System
Incorporation of IoT in Assembly Line Monitoring System
IRJET Journal
 
An open-source testbed for IoT systems
An open-source testbed for IoT systemsAn open-source testbed for IoT systems
An open-source testbed for IoT systems
Augusto Ciuffoletti
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
Jorge Cardoso
 
IRJET- Design of Fault Injection Technique for Digital HDL Models
IRJET-  	  Design of Fault Injection Technique for Digital HDL ModelsIRJET-  	  Design of Fault Injection Technique for Digital HDL Models
IRJET- Design of Fault Injection Technique for Digital HDL Models
IRJET Journal
 
IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...
IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...
IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...
IRJET Journal
 
In pursuit of architectural agility: experimenting with microservices
In pursuit of architectural agility: experimenting with microservicesIn pursuit of architectural agility: experimenting with microservices
In pursuit of architectural agility: experimenting with microservices
Alberto Simioni
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET Journal
 
IRJET- Object Detection using Machine Learning Technique
IRJET- Object Detection using Machine Learning TechniqueIRJET- Object Detection using Machine Learning Technique
IRJET- Object Detection using Machine Learning Technique
IRJET Journal
 
A SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATION
A SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATIONA SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATION
A SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATION
IJCSES Journal
 
IoT Workshop in Macao
IoT Workshop in MacaoIoT Workshop in Macao
IoT Workshop in Macao
Shigeru Kobayashi
 
IoT Workshop in Macao
IoT Workshop in MacaoIoT Workshop in Macao
IoT Workshop in Macao
Shigeru Kobayashi
 
Butler
ButlerButler
September Patch Tuesday Analysis 2018
September Patch Tuesday Analysis 2018September Patch Tuesday Analysis 2018
September Patch Tuesday Analysis 2018
Ivanti
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Jorge Cardoso
 

Similar to Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform" (20)

Neutron Extension API
Neutron Extension APINeutron Extension API
Neutron Extension API
 
IRJET- Web Application Firewall: Artificial Intelligence ARC
IRJET-  	  Web Application Firewall: Artificial Intelligence ARCIRJET-  	  Web Application Firewall: Artificial Intelligence ARC
IRJET- Web Application Firewall: Artificial Intelligence ARC
 
Dependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software BugsDependability Benchmarking by Injecting Software Bugs
Dependability Benchmarking by Injecting Software Bugs
 
IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...
IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...
IRJET- Windows Log Investigator System for Faster Root Cause Detection of a D...
 
ScaRR
ScaRRScaRR
ScaRR
 
20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment
 
Incorporation of IoT in Assembly Line Monitoring System
Incorporation of IoT in Assembly Line Monitoring SystemIncorporation of IoT in Assembly Line Monitoring System
Incorporation of IoT in Assembly Line Monitoring System
 
An open-source testbed for IoT systems
An open-source testbed for IoT systemsAn open-source testbed for IoT systems
An open-source testbed for IoT systems
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
 
IRJET- Design of Fault Injection Technique for Digital HDL Models
IRJET-  	  Design of Fault Injection Technique for Digital HDL ModelsIRJET-  	  Design of Fault Injection Technique for Digital HDL Models
IRJET- Design of Fault Injection Technique for Digital HDL Models
 
IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...
IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...
IRJET- A Defense System Against Application Layer Ddos Attacks with Data Secu...
 
In pursuit of architectural agility: experimenting with microservices
In pursuit of architectural agility: experimenting with microservicesIn pursuit of architectural agility: experimenting with microservices
In pursuit of architectural agility: experimenting with microservices
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud Environment
 
IRJET- Object Detection using Machine Learning Technique
IRJET- Object Detection using Machine Learning TechniqueIRJET- Object Detection using Machine Learning Technique
IRJET- Object Detection using Machine Learning Technique
 
A SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATION
A SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATIONA SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATION
A SURVEY OF VIRTUAL PROTOTYPING TECHNIQUES FOR SYSTEM DEVELOPMENT AND VALIDATION
 
IoT Workshop in Macao
IoT Workshop in MacaoIoT Workshop in Macao
IoT Workshop in Macao
 
IoT Workshop in Macao
IoT Workshop in MacaoIoT Workshop in Macao
IoT Workshop in Macao
 
Butler
ButlerButler
Butler
 
September Patch Tuesday Analysis 2018
September Patch Tuesday Analysis 2018September Patch Tuesday Analysis 2018
September Patch Tuesday Analysis 2018
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
 

Recently uploaded

Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
Northrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdfNorthrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdf
takipo7507
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Literature review for prompt engineering of ChatGPT.pptx
Literature review for prompt engineering of ChatGPT.pptxLiterature review for prompt engineering of ChatGPT.pptx
Literature review for prompt engineering of ChatGPT.pptx
LokerXu2
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
IJCNCJournal
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
 
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Dr.Costas Sachpazis
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
foxlyon
 
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
DharmaBanothu
 
Flow Through Pipe: the analysis of fluid flow within pipes
Flow Through Pipe:  the analysis of fluid flow within pipesFlow Through Pipe:  the analysis of fluid flow within pipes
Flow Through Pipe: the analysis of fluid flow within pipes
Indrajeet sahu
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
felixwold
 
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
nonods
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
MuhammadJazib15
 

Recently uploaded (20)

Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
Northrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdfNorthrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdf
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Literature review for prompt engineering of ChatGPT.pptx
Literature review for prompt engineering of ChatGPT.pptxLiterature review for prompt engineering of ChatGPT.pptx
Literature review for prompt engineering of ChatGPT.pptx
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
 
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
 
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
 
Flow Through Pipe: the analysis of fluid flow within pipes
Flow Through Pipe:  the analysis of fluid flow within pipesFlow Through Pipe:  the analysis of fluid flow within pipes
Flow Through Pipe: the analysis of fluid flow within pipes
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
 
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
 

Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform"

  • 1. How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform Domenico Cotroneo*, Luigi De Simone*, Pietro Liguori*, Roberto Natella*, Nematollah Bidokhti** *DIETI, Università degli Studi di Napoli Federico II, Italy **Futurewei Technologies, Inc., USA *{cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it **nbidokht@futurewei.com ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019
  • 2. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 2 Problem: The fragility of cloud computing infrastructure software Gunawi et al., 2016. “Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages”. In Proc. SoCC
  • 3. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 3 Our case study: OpenStack Nova Horizon Cinder NeutronGlance Keystone Swift 1. Failure notified by a timely API error (Fail-stop) instance creation request 2. Log messages with CRITICAL or ERROR severity 2019-08-27 15:13:20.106 ERROR nova.api.openstack.extensions Unexpected exception in API method … 3. Failure is isolated
  • 4. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 4 Contribution  Empirical analysis of high-severity failures in the OpenStack cloud computing platform: RQ1: Are failures actually “fail-stop”? RQ2: Are failures logged? RQ3: Are failures propagated across sub-systems?  Artifacts for reproducing our experimental environment in a virtual machine:  DOI: 10.6084/m9.figshare.8242877
  • 5. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 5 Fault Injection Methodology Workload Logs  API Errors - openstack instance create  Assertion (Healthy) Checks - Network Status: Active - Instance Status: Error OpenStack sub-systems Logs 2019-08-27 15:13:20.106 ERROR nova.api.openstack.extensions Unexpected exception in API method … // ~/nova/compute/api.py // ORIGINAL CODE // self.compute_task_api.schedule_and_build_instances (instanceID, build_parameters) // BUGGY CODE (missing parameter) self.compute_task_api.schedule_and_build_instances (instanceID) Workload
  • 6. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 6 Overview of a fault injection experiment iface_name = self.get_interface_name (network, port) Original Python code TIMELINE Faulty round ON Fault-free round OFF Injected Python code if bug_trigger == True: // BUGGY CODE (FAULTY ROUND) // Missing Parameter MP iface_name = self.get_interface_name (network) else: // CORRECT CODE (FAUL-FREE ROUND) iface_name = self.get_interface_name (network, port) Clean-up
  • 7. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 7 0 5 10 15 20 25 Numberbugfixes Fault type API DICT SQL RPC SYSTEM AGENT/PLUGIN We went through problem reports on Launchpad to identify recurring bug-fixing changes in OpenStack Which bugs should we inject? --- nova/virt/libvirt_conn.py 2011-01-25 12:44:26 +0000 +++ nova/virt/libvirt_conn.py 2011-01-25 20:42:26 +0000 @@ -1268,13 +1268,13 @@ if(ip_version == 4): # Allow DHCP responses dhcp_server = self._dhcp_server_for_instance(instance) - our_rules += ['-A %s -s %s -p udp --sport 67 --dport 68' % - (chain_name, dhcp_server)] + our_rules += ['-A %s -s %s -p udp --sport 67 --dport 68 ' + '-j ACCEPT ' % (chain_name, dhcp_server)] elif(ip_version == 6): Sub-system Fault type Nova Cinder Neutron ALL MFC 110 55 36 201 WPV 60 40 36 136 MP 57 38 36 131 WRV 149 96 59 304 TE 63 40 36 139 ALL 439 269 203 911
  • 8. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 8 Fail-stop Behavior Add Role Create Keypair Create Security Group Create Router Create Networ k Create Instance Create Floating IP Create Volume Reboot instance Create Image Create Domain Create Project Create User Create Subnetwork Set Gateway Add Floating IP to Instance Attach Volume to Instance Cleanup Resources TIMELINE API Error openstack instance create Workload When an API call generates an error, the workload is abortedAssertion Checks on the status of the virtual resources Network Status: Active
  • 9. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 9 Non Fail-stop Behavior API Error Cannot 'attach_volume’ instanceID while it is in vm_state error Instance Status: Error No API Error! Failure delay Workload The workload continues the execution regardless the assertion check(s) Add Role Create Keypair Create Security Group Create Router Create Networ k Create Instance Create Floating IP Create Volume Reboot instance Create Image Create Domain Create Project Create User Create Subnetwork Set Gateway Add Floating IP to Instance Attach Volume to Instance Cleanup Resources TIMELINE
  • 10. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 10 RQ1: Does OpenStack Show a Fail-Stop Behavior? 40% 37% 23% 35% 46% 18% 60% 32% 7% 44% 38% 18% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% API Error Only Assertion Failure(s) & API Error Assertion Failure(s) Only PercentageExperiments Failure Type Nova Cinder Neutron All sub-systems Failures notified by a timely API error Failures with no API error (but virtual resources are in incorrect state) Failures that were notified with a delay Fail-Stop Non Fail-Stop
  • 11. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 11 RQ1: Does OpenStack Show a Fail-Stop Behavior? Subsystem Median Latency [s] Assertion Failure(s) followed by API Error (Non Fail-stop) Nova 152.25 Cinder 74.52 Neutron 144.72 API Error Only (Fail-stop) Nova 3.73 Cinder 0.30 Neutron 0.30 Long API error latency (2 minutes on average) 0 100 200 300 400 time (s) 0 0.2 0.4 0.6 0.8 1 Probability Nova Neutron Cinder
  • 12. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 12 RQ2: Is OpenStack Able to Log Failures?  In 8.5% of experiments, no log messages with CRITICAL or ERROR severity Logging coverage Subsystem API Errors Only Assertion Failure(s) and API Errors Assertion Failure(s) Only Nova 90.32% 82.56% 80.77% Cinder 100% 100% 95.65% Neutron 98.67% 95% 66.67%
  • 13. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 13 8 Neutron Injection in Neutron Injection in Nova Injection in Cinder Failure SSH Failure Instance Active Failure Volume Attached Failure Volume Created Cinder API Error Nova API ErrorNeutron API Error Nova Cinder RQ3: Do Failures Propagate Across OpenStack? Faulty Round 39 22 74 108 78 83 37 25 56 5555 The failures propagate across OpenStack services in a significant amount of cases (37.5% of the failures)
  • 14. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 14 RQ3: Do Failures Propagate Across OpenStack? Fault-Free Round after fault removal Neutron Injection in Neutron Injection in Nova Injection in Cinder Failure SSH Failure Instance Active Failure Volume Attached Failure Volume Created Cinder API Error Nova API ErrorNeutron API Error Nova Cinder 24 24 7 7 Persistent Failures Even after that we disable the fault (fault-free round), OpenStack still experiences failures (7.5% of the cases).
  • 15. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 15 Conclusion (Answers) (1/2)  RQ1: Are failures actually “fail-stop”?  Answer: In the majority of the cases, OpenStack does not behave in a «fail-stop» way (late or no API error)  Suggestions: Mitigate failures by actively checking the status of virtual resources as in our assertion checks (e.g., checks incorporated in a monitoring solution)  RQ2: Are failures logged?  Answer: In a small fraction of the experiments, there was no indication of the failure in the logs  Suggestions: Improve logging in the source code (e.g., by checking for errors returned by the faulty function calls)
  • 16. ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 16  RQ3: Are failures propagated across sub-systems?  Answer: In most of the failures, the injected bugs propagated across several OpenStack sub-systems. There were also relevant cases of failures that caused subtle residual effects on OpenStack  Suggestions: Improve resource clean-up on errors, to prevent propagation across service API calls and across subsystems. Conclusion (Answers) (2/2) Use our artifact to support future research on mitigating the impact of software bugs (DOI: 10.6084/m9.figshare.8242877)