SlideShare a Scribd company logo
1 of 81
Download to read offline
fault injection campaigns for
experimental dependability assessments
of distributed systems
exemplified by a case study on OpenStack
Lukas Pirl
supervision: Lena Feinbube, Prof. Dr. Andreas Polze
Operating Systems and Middleware Group
Hasso Plattner Institute for Digital Engineering
University of Potsdam
June 2017
identification and exercise of
about this presentation
high-level to deliver concepts
few technical details about non-contributions
e.g., no listings of software used
details upon request, of course
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 2
introduction
motivation
software systems are increasingly …
… critical
failures can cause economic, ecologic, or even personal damage
… complex
higher probability of, e.g., design flaws, bugs, interferences
… distributed
notoriously hard
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 3
introduction
dependability
dependability is a measure of quality
allows comparisons
and thus optimization
dependability := the ability to deliver
service that can justifiably be trusted ¹
note: this does not imply, e.g., 100% availability: “service” depends on specification
1: A. Avižienis, J.-C. Laprie, and B. Randell, “Dependability and Its Threats: A Taxonomy,” in Building the Information Society, Springer, Boston, MA, 2004, pp. 91–120.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 4
introduction
dependability
concerns whole society
individuals
e.g., banking, communication
enterprises
e.g., resource planning, plant control
politics
e.g., electronic voting, taxing
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 5
introduction
dependability
concerns all deployment models
single software systems
e.g., in medical applications, such as pacemakers
distributed systems
e.g., in traffic, such as train protection
Cloud (i.e., IaaS)
e.g., in research, such as genome analysis
services (i.e., SaaS)
e.g., in information management, such as data storage
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 6
fault tolerant distributed systems do fail¹
2.5h Facebook outage 2010
“friendly” DDoS due to wrong configuration value
8h Azure outage 2012
leap day bug in SSL certificate generation
4.5h Amazon S3 outage 2017
typo in manual command took “too many” servers down
introduction
war stories
see this link for visualizations and more examples, 1: D. Oppenheimer et al “Why do Internet services fail, and what can be done about it?,” in USENIX symposium on internet techn. and systems, 2003, vol. 67.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 7
introduction
fault injection
fault injection
⊂ testing
experimental dependability assessment
with relatively low complexity
concept:
1. forcefully introduce faults
fault := suspected error cause
2. assess the delivered quality of service
long-established for testing hardware
testing
fault
injection
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 8
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 9
introduction
software fault injection
software fault injection
implemented in software and targeting software
!= HWIFI: hardware-implemented & targeting hardware
!= SWIFI: software-implemented & targeting hardware
not yet widely adopted for testing software
Missing accessibility?
tools too specialized?
available information too scattered/heterogeneous/theoretical?
Missing automation?
e.g., in comparison unit testing
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 10
introduction
fundamental questions for assessments
very specific to the system under test
What to inject? ➜ fault model
fault classes, e.g., crash faults
When to inject? ➜ trigger mechanism
when injecting during runtime, likely chosen according to workload
Where to inject? ➜ dependability model
at spots where faults should be tolerated
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 11
introduction
software fault injection – success stories
fault injection in production
Etsy ¹
Netflix
Chaos Monkey
terminates Amazon EC2 instances
in Auto Scaling Groups
during business hours only
staff is watching and can react quickly
1: J. Allspaw, “Fault injection in production,” Communications of the ACM, vol. 55, no. 10, pp. 48–52, 2012.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 12
D. Avresky, J. Arlat, J.-C. Laprie, and Y. Crouzet, “Fault injection for formal testing of fault tolerance,” IEEE Transactions on Reliability, vol. 45, no. 3, pp. 443–455, 1996.
introduction
related work
programmatic identification of the fault load
extracted from code
would testing be self-fulfilling?
extracted from formal description
laborious, complicated
➜ little incorporation of users’ expectations
user := person assessing the dependability of a service
e.g., developers, architects
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 13
introduction
related work
tools for injecting faults
tend to be highly specialized
e.g., programming language, execution environments, APIs
➜ hard to adapt
tend to employ a fine granularity
and with a lot of components in many layers of abstraction?
➜ inappropriate for complex systems
too many injection points / too complex
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 14
research question
How can we make software fault injection
for complex distributed systems
more easily accessible and integratable
in software development practices?
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 15
approach
model-based
capture users’ expectations in a comprehensible format
full automation
campaign derivation and exercise
campaign := all fault injection experiments for specific assessment
case study
on OpenStack
recent, complex, distributed, promises high-availability
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 16
system under test
approach
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 17
measurements
model
automation
workload
system under test
workload
measurements
approach
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 18
campaign
exercisederivationmodel
system under test
workload
measurements
dependability modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 19
campaign
exercisederivationmodel
dependability modeling
models
to capture users’ expectations
comprehensible by non-developers
commonly used to cope with complexity
usually formal
machine-readable
fosters representativeness
because incorporates users’ experience
e.g., structure of the system under test, granularity, fault types
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 20
dependability modeling
fault tree diagrams
for root cause analyses
top-down, deductive
used up-front or for accident investigation
Boolean logic
extensions for timing, dependencies, etc.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 21
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 22
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 23
fault tree diagrams
online editor available
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 24
Where to inject?
system under test
e.g., source code
interfaces of system under test
e.g., APIs
interfaces built on
e.g., other services, host OS
hardware
e.g., network, data, performance, clock drift
➜ need clear scope to cope with complexity
dependability modeling
granularity
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 25
system under test
…
…
workload
…
dependability modeling
granularity
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 26
for this assessment: nodes are black boxes
node := physical or virtual machines the system is composed of
i.e., nodes seen as “atoms” of system
➜ nodes are targets for fault injection
system under test
workload
measurements
dependability modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 27
campaign
exercisederivationmodel
workload
model
measurements
OpenStack
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 28
campaign
exercisederivation
system under test
case study
assessing the dependability of OpenStack
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 29
OpenStack
framework for building IaaS platforms
providing, e.g., VMs, storage, networking to its users
free, open-source software
Apache License 2.0
emerged from projects at RackSpace and NASA
i.e., from Cloud Files and OpenNebula
composed of services
communication to/between them mostly RESTful APIs via HTTP
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 30
OpenStack
services
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 31
dashboard
Horizon
auth*
Keystone
virtual
machine
image
Glance
object
Swift
block
Cinder
network
Neutron
compute
Nova
provides
images
might
use
provides
block
storage provides
network
manages
stores
backups
stores
images
provides
UI
provides
auth
VNC
access
OpenStack Fuel
OpenStack distribution
UIs to manage OpenStack instances
capable of installing a high availability setup
high availability :≈ as commonly used
(i.e., higher dependability than common comparable systems)
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 32
OpenStack Fuel
concept
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 33
Fuel master
configured
OpenStack
node
configured
OpenStack
node
OpenStack
node
getting
configured
unconfigured
node
unconfigured
node
master sends OS
images via PXE for
bootstrapping
OpenStack Fuel
OpenStack distribution
UIs to manage OpenStack instances
capable of installing a high availability setup
high availability :≈ as commonly used
(i.e., higher dependability than common comparable systems)
distributes services according to node roles
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 34
provides
UI
OpenStack Fuel
node roles
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 35
dashboard
Horizon
auth*
Keystone
virtual
machine
image
Glance
block
Cinder
network
Neutron
compute
Nova
object
Swiftprovides
images might
use
provides
block
storage
provides
network
manages
stores
backups
stores
images
provides
auth
VNC
access
storage
controller
compute
OpenStack Fuel
nodes and networks
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 36
controllercontrollercontroller storagestoragestoragecomputecompute
mgmt storage
publicprivate
master
PXEinternet
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 37
1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz
32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM
120GB HP 765479-B21 SSD (M.2 2280)
Ubuntu 16.04, Linux 4.4.0
OpenStack Fuel
deployment attempt 1
main insights:
A. unusably slow
B. unstable
a. too little RAM
OpenStack Fuel
deployment attempt 2
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 38
2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz
384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM
1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache
Debian 8.6, Linux 4.7.0
main insights:
A. very hard (but possible) to
get sufficient performance
(esp. IO) with nested
virtualization
B. snapshot and restore of
running OpenStack via
outer VM does not work
OpenStack Fuel
deployment attempt 2: virtualization IO settings
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 39
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 40
OpenStack Fuel
deployment attempt 3
2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz
384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM
1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache
Debian 8.6, Linux 4.7.0
main insights:
A. usable performance
B. snapshot and restore of
running OpenStack via inner
VMs does not work
a. VMs need shutdown
and reboot
Unfortunately, the physical host
became unavailable during the
assessment.
OpenStack Fuel
deployment used for assessments
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 41
1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz
32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM
120GB HP 765479-B21 SSD (M.2 2280)
Mellanox Connect-X3 Pro (Dual 10GbE, one used)
Ubuntu 16.04, Linux 4.4.0
each
OpenStack Fuel
deployment used for assessments
virtualized
easy programmatic machine crashes, freezes, etc.
virtualized hardware
i.e., full virtualization
enables programmatic injection in virtual hardware
virtualized networks
easy programmatic injection in network traffic
9 nodes distributed over 5 physical hosts
after a lot of experimentation with hardware & virtualization
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 42
workload
model
measurements
example workload
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 43
campaign
exercisederivation
system under test
system under test
model
measurements
example workload
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 44
campaign
exercisederivation
workload
example workload
Tahoe-LAFS
Least-Authority File Store
distributed file store
free, open-source, peer-to-peer
uses three main IaaS components
storage, compute, network
automatable through APIs
performance counters for analysis
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 45
example workload
Tahoe-LAFS grid
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 46
introducer
CLI
sftp
REST
…
node
node
node
trusted
node
example workload
storage stack
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 47
system under test
model
measurements
example workload
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 48
campaign
exercisederivation
workload
measurements
fault tree modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 49
campaign
exercisederivation
system under test
model
workload
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 50
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 51
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 52
OpenStack Fuel
controller HA
https://docs.mirantis.com/openstack/fuel/fuel-8.0/mos-planning-guide.html#fuel-reference-architecture-overview
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 53
HAProxy
load balancing & failover
MySQL cluster
active/active
RabbitMQ cluster
active/active
pacemaker
cluster manager
e.g., trigger MySQL rebuild after network partitioning
corosync
manages quorum
e.g., stop services if controller not in quorum
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 54
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 55
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 56
fault tree modeling
alternative input: fault tree in dot format
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 57
measurements
fault tree modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 58
campaign
exercisederivation
system under test
model
workload
measurements
system under test
campaign derivation
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 59
model
workload
campaign
exercisederivation
campaign derivation
derive campaign from dependability model
representative
i.e., injected faults are realistic & relevant
full coverage
i.e., test all modeled fault tolerance mechanisms
repeatable
i.e., ability to repeat exercising a campaign
efficient
i.e., a reasonable cost-benefit ratio
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 60
campaign derivation
dependability stress
F all fault injection points
≙ all basic events
S ⊆ F scenario
≙ a set of faults to inject
A := ℘(F) all possible scenarios
≙ all possible subsets of F
C ⊆ A campaign
≙ sets of faults to inject
s ∊ C ⇔ ?
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 61
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 62
campaign derivation
dependability stress
exercising all scenarios is not feasible
|℘(F)| = 2|F|
➜ exclude certain scenarios
scenarios which are expected to fail
re-using well-known algorithm for fault trees (MOCUS)
scenarios which are included in others
e.g., exclude {crash A} if there is {crash A, crash B}
i.e., “maximization” of scenarios
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 63
all |℘(F)| = 2|F|
scenarios
expect success
scenarios
campaign
expect failure
scenarios
non-maximal
scenarios
campaign derivation
dependability stress
campaign derivation
dependability stress
maximize “dependability stress”
i.e., inject as many faults simultaneously as tolerable
tests for synergistic failures
i.e., tests for interferences between fault tolerance mechanisms
since fault tolerance mechanisms are active simultaneously
increases efficiency
scenarios are “merged”
repeatable
derivation is deterministic
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 64
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 65
case study
campaign derivation
all scenarios: 2048
would take ~333h to exercise
maximal scenarios: 36
takes ~6h to exercise
…with 3 fault types & 10 runs per scenario:
all scenarios: 61440: ~416d to exercise
maximal scenarios: 1080 scenarios: ~7.3d to exercise
measurements
system under test
campaign derivation
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 66
model
workload
campaign
exercisederivation
measurements
system under test
exercise of the campaign
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 67
model
workload
campaign
derivation exercise
via user-provided implementation
executables in a specific directory structure
get state as CLI parameters
any language of preference
reuse existing tools
low level of integration
easily exchangeable
well-known pattern from UNIX-like OS
e.g., “hooks”, run-parts utility
exercise of the campaign
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 68
exercise of the campaign
directories for executables
1 ./pre-campaign
e.g., take snapshot of system, prepare logging
2-1 ./pre-scenario
e.g., restore snapshot, measure performance
2-2 ./event
invoked per fault to inject, e.g., power off a VM
2-3 ./post-scenario
e.g., measure performance
3 ./post-campaign
e.g., restore snapshot, trigger data analysis
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 69
exerciseofthecampaign
workers
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 70
measurements
system under test
exercise of the campaign
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 71
model
workload
campaign
derivation exercise
system under test
results
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 72
model
workload
campaign
derivation exercise
measurements
results
measurements
perspective of OpenStack users
one specific use case
upload 100MB of random binary data to Tahoe-LAFS grid
before, during, after faults are active
collect performance counters from Tahoe-LAFS
analysis based on performance degradation
“how many times it took longer in the presence of faults”
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 73
results
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 74
results
degradation after activating stop faults
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 75
baseline
results
degradation after deactivating stop faults
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 76
baseline
system under test
results
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 77
model
workload
campaign
derivation exercise
measurements
derivation
system under test
conclusion
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 78
model
workload
campaign
exercise
measurements
conclusion
challenges
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 79
campaign derivation
with non-Boolean elements in the dependability model
complex distributed systems
getting a fully virtualized OpenStack setup
with performant nested virtualization
restoring from snapshots
after restore: boot all nodes & wait for OpenStack to be operable
automation
absent APIs, long run times (hard to debug), sporadic failures
conclusion
contributions
[previous slide] +
framework for software fault injection
model-based, automated, repeatable, flexible
dependability assessment of OpenStack Fuel
reusable implementation for exercising the campaign
baseline performance fluctuation
two fault types
crash, stop
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 80
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 81
? !Lukas Pirl | slideshare+sfi-os@lukas-pirl.de | http://lukas-pirl.de
supervision: Lena Feinbube, Prof. Dr. Andreas Polze
Operating Systems and Middleware Group
Hasso Plattner Institute for Digital Engineering
University of Potsdam
June 2017

More Related Content

Similar to identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack

2015 GT FDA Elmendorf - ADAS and SDI-Title
2015 GT FDA Elmendorf - ADAS and SDI-Title2015 GT FDA Elmendorf - ADAS and SDI-Title
2015 GT FDA Elmendorf - ADAS and SDI-Title
Grid Protection Alliance
 
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
ijtsrd
 
Risk of Adopting Open Source ERP for Small Manufacturers: A Case Study
Risk of Adopting Open Source ERP for Small Manufacturers: A Case StudyRisk of Adopting Open Source ERP for Small Manufacturers: A Case Study
Risk of Adopting Open Source ERP for Small Manufacturers: A Case Study
Placide Poba Nzaou
 
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation TechniquesReview on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
ijtsrd
 

Similar to identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack (20)

Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
 
Human Computer Interaction Chapter 4 Implementation Support and Evaluation Te...
Human Computer Interaction Chapter 4 Implementation Support and Evaluation Te...Human Computer Interaction Chapter 4 Implementation Support and Evaluation Te...
Human Computer Interaction Chapter 4 Implementation Support and Evaluation Te...
 
Final_version_SAI_ST_projectenboekje_2015
Final_version_SAI_ST_projectenboekje_2015Final_version_SAI_ST_projectenboekje_2015
Final_version_SAI_ST_projectenboekje_2015
 
RDA Virtual Research Environments Workshop
RDA Virtual Research Environments WorkshopRDA Virtual Research Environments Workshop
RDA Virtual Research Environments Workshop
 
Cloud ERP Security: Guidelines for evaluation
Cloud ERP Security: Guidelines for evaluationCloud ERP Security: Guidelines for evaluation
Cloud ERP Security: Guidelines for evaluation
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
 
Robust Expert Finding in Web-Based Community Information Systems
Robust Expert Finding in Web-Based Community Information SystemsRobust Expert Finding in Web-Based Community Information Systems
Robust Expert Finding in Web-Based Community Information Systems
 
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...IRJET -  	  Neural Network based Leaf Disease Detection and Remedy Recommenda...
IRJET - Neural Network based Leaf Disease Detection and Remedy Recommenda...
 
A Study of Software Size Estimation with use Case Points
A Study of Software Size Estimation with use Case PointsA Study of Software Size Estimation with use Case Points
A Study of Software Size Estimation with use Case Points
 
From Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsFrom Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research Highlights
 
SplunkLive! London 2017 - An End-To-End Approach: Detect via Behavious and Re...
SplunkLive! London 2017 - An End-To-End Approach: Detect via Behavious and Re...SplunkLive! London 2017 - An End-To-End Approach: Detect via Behavious and Re...
SplunkLive! London 2017 - An End-To-End Approach: Detect via Behavious and Re...
 
2015 GT FDA Elmendorf - ADAS and SDI-Title
2015 GT FDA Elmendorf - ADAS and SDI-Title2015 GT FDA Elmendorf - ADAS and SDI-Title
2015 GT FDA Elmendorf - ADAS and SDI-Title
 
Replication and Benchmarking in Software Analytics
Replication and Benchmarking in Software AnalyticsReplication and Benchmarking in Software Analytics
Replication and Benchmarking in Software Analytics
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
 
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
 
Software Sustainability Institute
Software Sustainability InstituteSoftware Sustainability Institute
Software Sustainability Institute
 
Risk of Adopting Open Source ERP for Small Manufacturers: A Case Study
Risk of Adopting Open Source ERP for Small Manufacturers: A Case StudyRisk of Adopting Open Source ERP for Small Manufacturers: A Case Study
Risk of Adopting Open Source ERP for Small Manufacturers: A Case Study
 
IIA4: Open Source and the Enterprise ( Predix Transform 2016)
IIA4: Open Source and the Enterprise ( Predix Transform 2016)IIA4: Open Source and the Enterprise ( Predix Transform 2016)
IIA4: Open Source and the Enterprise ( Predix Transform 2016)
 
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation TechniquesReview on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
 

Recently uploaded

Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
drm1699
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

From Knowledge Graphs via Lego Bricks to scientific conversations.pptx
From Knowledge Graphs via Lego Bricks to scientific conversations.pptxFrom Knowledge Graphs via Lego Bricks to scientific conversations.pptx
From Knowledge Graphs via Lego Bricks to scientific conversations.pptx
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with GraphGraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMs
 

identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack

  • 1. fault injection campaigns for experimental dependability assessments of distributed systems exemplified by a case study on OpenStack Lukas Pirl supervision: Lena Feinbube, Prof. Dr. Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute for Digital Engineering University of Potsdam June 2017 identification and exercise of
  • 2. about this presentation high-level to deliver concepts few technical details about non-contributions e.g., no listings of software used details upon request, of course Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 2
  • 3. introduction motivation software systems are increasingly … … critical failures can cause economic, ecologic, or even personal damage … complex higher probability of, e.g., design flaws, bugs, interferences … distributed notoriously hard Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 3
  • 4. introduction dependability dependability is a measure of quality allows comparisons and thus optimization dependability := the ability to deliver service that can justifiably be trusted ¹ note: this does not imply, e.g., 100% availability: “service” depends on specification 1: A. Avižienis, J.-C. Laprie, and B. Randell, “Dependability and Its Threats: A Taxonomy,” in Building the Information Society, Springer, Boston, MA, 2004, pp. 91–120. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 4
  • 5. introduction dependability concerns whole society individuals e.g., banking, communication enterprises e.g., resource planning, plant control politics e.g., electronic voting, taxing Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 5
  • 6. introduction dependability concerns all deployment models single software systems e.g., in medical applications, such as pacemakers distributed systems e.g., in traffic, such as train protection Cloud (i.e., IaaS) e.g., in research, such as genome analysis services (i.e., SaaS) e.g., in information management, such as data storage Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 6
  • 7. fault tolerant distributed systems do fail¹ 2.5h Facebook outage 2010 “friendly” DDoS due to wrong configuration value 8h Azure outage 2012 leap day bug in SSL certificate generation 4.5h Amazon S3 outage 2017 typo in manual command took “too many” servers down introduction war stories see this link for visualizations and more examples, 1: D. Oppenheimer et al “Why do Internet services fail, and what can be done about it?,” in USENIX symposium on internet techn. and systems, 2003, vol. 67. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 7
  • 8. introduction fault injection fault injection ⊂ testing experimental dependability assessment with relatively low complexity concept: 1. forcefully introduce faults fault := suspected error cause 2. assess the delivered quality of service long-established for testing hardware testing fault injection Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 8
  • 9. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 9
  • 10. introduction software fault injection software fault injection implemented in software and targeting software != HWIFI: hardware-implemented & targeting hardware != SWIFI: software-implemented & targeting hardware not yet widely adopted for testing software Missing accessibility? tools too specialized? available information too scattered/heterogeneous/theoretical? Missing automation? e.g., in comparison unit testing Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 10
  • 11. introduction fundamental questions for assessments very specific to the system under test What to inject? ➜ fault model fault classes, e.g., crash faults When to inject? ➜ trigger mechanism when injecting during runtime, likely chosen according to workload Where to inject? ➜ dependability model at spots where faults should be tolerated Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 11
  • 12. introduction software fault injection – success stories fault injection in production Etsy ¹ Netflix Chaos Monkey terminates Amazon EC2 instances in Auto Scaling Groups during business hours only staff is watching and can react quickly 1: J. Allspaw, “Fault injection in production,” Communications of the ACM, vol. 55, no. 10, pp. 48–52, 2012. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 12
  • 13. D. Avresky, J. Arlat, J.-C. Laprie, and Y. Crouzet, “Fault injection for formal testing of fault tolerance,” IEEE Transactions on Reliability, vol. 45, no. 3, pp. 443–455, 1996. introduction related work programmatic identification of the fault load extracted from code would testing be self-fulfilling? extracted from formal description laborious, complicated ➜ little incorporation of users’ expectations user := person assessing the dependability of a service e.g., developers, architects Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 13
  • 14. introduction related work tools for injecting faults tend to be highly specialized e.g., programming language, execution environments, APIs ➜ hard to adapt tend to employ a fine granularity and with a lot of components in many layers of abstraction? ➜ inappropriate for complex systems too many injection points / too complex Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 14
  • 15. research question How can we make software fault injection for complex distributed systems more easily accessible and integratable in software development practices? Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 15
  • 16. approach model-based capture users’ expectations in a comprehensible format full automation campaign derivation and exercise campaign := all fault injection experiments for specific assessment case study on OpenStack recent, complex, distributed, promises high-availability Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 16
  • 17. system under test approach Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 17 measurements model automation workload
  • 18. system under test workload measurements approach Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 18 campaign exercisederivationmodel
  • 19. system under test workload measurements dependability modeling Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 19 campaign exercisederivationmodel
  • 20. dependability modeling models to capture users’ expectations comprehensible by non-developers commonly used to cope with complexity usually formal machine-readable fosters representativeness because incorporates users’ experience e.g., structure of the system under test, granularity, fault types Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 20
  • 21. dependability modeling fault tree diagrams for root cause analyses top-down, deductive used up-front or for accident investigation Boolean logic extensions for timing, dependencies, etc. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 21
  • 22. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 22
  • 23. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 23
  • 24. fault tree diagrams online editor available Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 24
  • 25. Where to inject? system under test e.g., source code interfaces of system under test e.g., APIs interfaces built on e.g., other services, host OS hardware e.g., network, data, performance, clock drift ➜ need clear scope to cope with complexity dependability modeling granularity Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 25 system under test … … workload …
  • 26. dependability modeling granularity Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 26 for this assessment: nodes are black boxes node := physical or virtual machines the system is composed of i.e., nodes seen as “atoms” of system ➜ nodes are targets for fault injection
  • 27. system under test workload measurements dependability modeling Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 27 campaign exercisederivationmodel
  • 28. workload model measurements OpenStack Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 28 campaign exercisederivation system under test
  • 29. case study assessing the dependability of OpenStack Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 29
  • 30. OpenStack framework for building IaaS platforms providing, e.g., VMs, storage, networking to its users free, open-source software Apache License 2.0 emerged from projects at RackSpace and NASA i.e., from Cloud Files and OpenNebula composed of services communication to/between them mostly RESTful APIs via HTTP Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 30
  • 31. OpenStack services Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 31 dashboard Horizon auth* Keystone virtual machine image Glance object Swift block Cinder network Neutron compute Nova provides images might use provides block storage provides network manages stores backups stores images provides UI provides auth VNC access
  • 32. OpenStack Fuel OpenStack distribution UIs to manage OpenStack instances capable of installing a high availability setup high availability :≈ as commonly used (i.e., higher dependability than common comparable systems) Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 32
  • 33. OpenStack Fuel concept Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 33 Fuel master configured OpenStack node configured OpenStack node OpenStack node getting configured unconfigured node unconfigured node master sends OS images via PXE for bootstrapping
  • 34. OpenStack Fuel OpenStack distribution UIs to manage OpenStack instances capable of installing a high availability setup high availability :≈ as commonly used (i.e., higher dependability than common comparable systems) distributes services according to node roles Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 34
  • 35. provides UI OpenStack Fuel node roles Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 35 dashboard Horizon auth* Keystone virtual machine image Glance block Cinder network Neutron compute Nova object Swiftprovides images might use provides block storage provides network manages stores backups stores images provides auth VNC access storage controller compute
  • 36. OpenStack Fuel nodes and networks Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 36 controllercontrollercontroller storagestoragestoragecomputecompute mgmt storage publicprivate master PXEinternet
  • 37. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 37 1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz 32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM 120GB HP 765479-B21 SSD (M.2 2280) Ubuntu 16.04, Linux 4.4.0 OpenStack Fuel deployment attempt 1 main insights: A. unusably slow B. unstable a. too little RAM
  • 38. OpenStack Fuel deployment attempt 2 Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 38 2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz 384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM 1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache Debian 8.6, Linux 4.7.0 main insights: A. very hard (but possible) to get sufficient performance (esp. IO) with nested virtualization B. snapshot and restore of running OpenStack via outer VM does not work
  • 39. OpenStack Fuel deployment attempt 2: virtualization IO settings Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 39
  • 40. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 40 OpenStack Fuel deployment attempt 3 2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz 384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM 1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache Debian 8.6, Linux 4.7.0 main insights: A. usable performance B. snapshot and restore of running OpenStack via inner VMs does not work a. VMs need shutdown and reboot Unfortunately, the physical host became unavailable during the assessment.
  • 41. OpenStack Fuel deployment used for assessments Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 41 1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz 32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM 120GB HP 765479-B21 SSD (M.2 2280) Mellanox Connect-X3 Pro (Dual 10GbE, one used) Ubuntu 16.04, Linux 4.4.0 each
  • 42. OpenStack Fuel deployment used for assessments virtualized easy programmatic machine crashes, freezes, etc. virtualized hardware i.e., full virtualization enables programmatic injection in virtual hardware virtualized networks easy programmatic injection in network traffic 9 nodes distributed over 5 physical hosts after a lot of experimentation with hardware & virtualization Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 42
  • 43. workload model measurements example workload Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 43 campaign exercisederivation system under test
  • 44. system under test model measurements example workload Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 44 campaign exercisederivation workload
  • 45. example workload Tahoe-LAFS Least-Authority File Store distributed file store free, open-source, peer-to-peer uses three main IaaS components storage, compute, network automatable through APIs performance counters for analysis Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 45
  • 46. example workload Tahoe-LAFS grid Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 46 introducer CLI sftp REST … node node node trusted node
  • 47. example workload storage stack Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 47
  • 48. system under test model measurements example workload Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 48 campaign exercisederivation workload
  • 49. measurements fault tree modeling Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 49 campaign exercisederivation system under test model workload
  • 50. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 50
  • 51. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 51
  • 52. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 52
  • 53. OpenStack Fuel controller HA https://docs.mirantis.com/openstack/fuel/fuel-8.0/mos-planning-guide.html#fuel-reference-architecture-overview Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 53 HAProxy load balancing & failover MySQL cluster active/active RabbitMQ cluster active/active pacemaker cluster manager e.g., trigger MySQL rebuild after network partitioning corosync manages quorum e.g., stop services if controller not in quorum
  • 54. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 54
  • 55. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 55
  • 56. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 56
  • 57. fault tree modeling alternative input: fault tree in dot format Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 57
  • 58. measurements fault tree modeling Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 58 campaign exercisederivation system under test model workload
  • 59. measurements system under test campaign derivation Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 59 model workload campaign exercisederivation
  • 60. campaign derivation derive campaign from dependability model representative i.e., injected faults are realistic & relevant full coverage i.e., test all modeled fault tolerance mechanisms repeatable i.e., ability to repeat exercising a campaign efficient i.e., a reasonable cost-benefit ratio Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 60
  • 61. campaign derivation dependability stress F all fault injection points ≙ all basic events S ⊆ F scenario ≙ a set of faults to inject A := ℘(F) all possible scenarios ≙ all possible subsets of F C ⊆ A campaign ≙ sets of faults to inject s ∊ C ⇔ ? Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 61
  • 62. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 62 campaign derivation dependability stress exercising all scenarios is not feasible |℘(F)| = 2|F| ➜ exclude certain scenarios scenarios which are expected to fail re-using well-known algorithm for fault trees (MOCUS) scenarios which are included in others e.g., exclude {crash A} if there is {crash A, crash B} i.e., “maximization” of scenarios
  • 63. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 63 all |℘(F)| = 2|F| scenarios expect success scenarios campaign expect failure scenarios non-maximal scenarios campaign derivation dependability stress
  • 64. campaign derivation dependability stress maximize “dependability stress” i.e., inject as many faults simultaneously as tolerable tests for synergistic failures i.e., tests for interferences between fault tolerance mechanisms since fault tolerance mechanisms are active simultaneously increases efficiency scenarios are “merged” repeatable derivation is deterministic Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 64
  • 65. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 65 case study campaign derivation all scenarios: 2048 would take ~333h to exercise maximal scenarios: 36 takes ~6h to exercise …with 3 fault types & 10 runs per scenario: all scenarios: 61440: ~416d to exercise maximal scenarios: 1080 scenarios: ~7.3d to exercise
  • 66. measurements system under test campaign derivation Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 66 model workload campaign exercisederivation
  • 67. measurements system under test exercise of the campaign Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 67 model workload campaign derivation exercise
  • 68. via user-provided implementation executables in a specific directory structure get state as CLI parameters any language of preference reuse existing tools low level of integration easily exchangeable well-known pattern from UNIX-like OS e.g., “hooks”, run-parts utility exercise of the campaign Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 68
  • 69. exercise of the campaign directories for executables 1 ./pre-campaign e.g., take snapshot of system, prepare logging 2-1 ./pre-scenario e.g., restore snapshot, measure performance 2-2 ./event invoked per fault to inject, e.g., power off a VM 2-3 ./post-scenario e.g., measure performance 3 ./post-campaign e.g., restore snapshot, trigger data analysis Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 69
  • 70. exerciseofthecampaign workers Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 70
  • 71. measurements system under test exercise of the campaign Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 71 model workload campaign derivation exercise
  • 72. system under test results Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 72 model workload campaign derivation exercise measurements
  • 73. results measurements perspective of OpenStack users one specific use case upload 100MB of random binary data to Tahoe-LAFS grid before, during, after faults are active collect performance counters from Tahoe-LAFS analysis based on performance degradation “how many times it took longer in the presence of faults” Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 73
  • 74. results Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 74
  • 75. results degradation after activating stop faults Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 75 baseline
  • 76. results degradation after deactivating stop faults Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 76 baseline
  • 77. system under test results Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 77 model workload campaign derivation exercise measurements
  • 78. derivation system under test conclusion Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 78 model workload campaign exercise measurements
  • 79. conclusion challenges Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 79 campaign derivation with non-Boolean elements in the dependability model complex distributed systems getting a fully virtualized OpenStack setup with performant nested virtualization restoring from snapshots after restore: boot all nodes & wait for OpenStack to be operable automation absent APIs, long run times (hard to debug), sporadic failures
  • 80. conclusion contributions [previous slide] + framework for software fault injection model-based, automated, repeatable, flexible dependability assessment of OpenStack Fuel reusable implementation for exercising the campaign baseline performance fluctuation two fault types crash, stop Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 80
  • 81. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 81 ? !Lukas Pirl | slideshare+sfi-os@lukas-pirl.de | http://lukas-pirl.de supervision: Lena Feinbube, Prof. Dr. Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute for Digital Engineering University of Potsdam June 2017