identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack

fault injection campaigns for
experimental dependability assessments
of distributed systems
exemplified by a case study on OpenStack
Lukas Pirl
supervision: Lena Feinbube, Prof. Dr. Andreas Polze
Operating Systems and Middleware Group
Hasso Plattner Institute for Digital Engineering
University of Potsdam
June 2017
identification and exercise of

about this presentation
high-level to deliver concepts
few technical details about non-contributions
e.g., no listings of software used
details upon request, of course
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 2

introduction
motivation
software systems are increasingly …
… critical
failures can cause economic, ecologic, or even personal damage
… complex
higher probability of, e.g., design flaws, bugs, interferences
… distributed
notoriously hard

introduction
dependability
dependability is a measure of quality
allows comparisons
and thus optimization
dependability := the ability to deliver
service that can justifiably be trusted ¹
note: this does not imply, e.g., 100% availability: “service” depends on specification
1: A. Avižienis, J.-C. Laprie, and B. Randell, “Dependability and Its Threats: A Taxonomy,” in Building the Information Society, Springer, Boston, MA, 2004, pp. 91–120.

introduction
dependability
concerns whole society
individuals
e.g., banking, communication
enterprises
e.g., resource planning, plant control
politics
e.g., electronic voting, taxing

introduction
dependability
concerns all deployment models
single software systems
e.g., in medical applications, such as pacemakers
distributed systems
e.g., in traffic, such as train protection
Cloud (i.e., IaaS)
e.g., in research, such as genome analysis
services (i.e., SaaS)
e.g., in information management, such as data storage

fault tolerant distributed systems do fail¹
2.5h Facebook outage 2010
“friendly” DDoS due to wrong configuration value
8h Azure outage 2012
leap day bug in SSL certificate generation
4.5h Amazon S3 outage 2017
typo in manual command took “too many” servers down
introduction
war stories
see this link for visualizations and more examples, 1: D. Oppenheimer et al “Why do Internet services fail, and what can be done about it?,” in USENIX symposium on internet techn. and systems, 2003, vol. 67.

introduction
fault injection
fault injection
⊂ testing
experimental dependability assessment
with relatively low complexity
concept:
1. forcefully introduce faults
fault := suspected error cause
2. assess the delivered quality of service
long-established for testing hardware
testing
fault
injection

introduction
software fault injection
software fault injection
implemented in software and targeting software
!= HWIFI: hardware-implemented & targeting hardware
!= SWIFI: software-implemented & targeting hardware
not yet widely adopted for testing software
Missing accessibility?
tools too specialized?
available information too scattered/heterogeneous/theoretical?
Missing automation?
e.g., in comparison unit testing

introduction
fundamental questions for assessments
very specific to the system under test
What to inject? ➜ fault model
fault classes, e.g., crash faults
When to inject? ➜ trigger mechanism
when injecting during runtime, likely chosen according to workload
Where to inject? ➜ dependability model
at spots where faults should be tolerated

introduction
software fault injection – success stories
fault injection in production
Etsy ¹
Netflix
Chaos Monkey
terminates Amazon EC2 instances
in Auto Scaling Groups
during business hours only
staff is watching and can react quickly
1: J. Allspaw, “Fault injection in production,” Communications of the ACM, vol. 55, no. 10, pp. 48–52, 2012.

D. Avresky, J. Arlat, J.-C. Laprie, and Y. Crouzet, “Fault injection for formal testing of fault tolerance,” IEEE Transactions on Reliability, vol. 45, no. 3, pp. 443–455, 1996.
introduction
related work
programmatic identification of the fault load
extracted from code
would testing be self-fulfilling?
extracted from formal description
laborious, complicated
➜ little incorporation of users’ expectations
user := person assessing the dependability of a service
e.g., developers, architects

introduction
related work
tools for injecting faults
tend to be highly specialized
e.g., programming language, execution environments, APIs
➜ hard to adapt
tend to employ a fine granularity
and with a lot of components in many layers of abstraction?
➜ inappropriate for complex systems
too many injection points / too complex

research question
How can we make software fault injection
for complex distributed systems
more easily accessible and integratable
in software development practices?

approach
model-based
capture users’ expectations in a comprehensible format
full automation
campaign derivation and exercise
campaign := all fault injection experiments for specific assessment
case study
on OpenStack
recent, complex, distributed, promises high-availability

system under test
approach
measurements
model
automation
workload

system under test
workload
measurements
approach
campaign
exercisederivationmodel

system under test
workload
measurements
dependability modeling
campaign

models
to capture users’ expectations
comprehensible by non-developers
commonly used to cope with complexity
usually formal
machine-readable
fosters representativeness
because incorporates users’ experience
e.g., structure of the system under test, granularity, fault types

fault tree diagrams
for root cause analyses
top-down, deductive
used up-front or for accident investigation
Boolean logic
extensions for timing, dependencies, etc.

fault tree diagrams
online editor available

Where to inject?
system under test
e.g., source code
interfaces of system under test
e.g., APIs
interfaces built on
e.g., other services, host OS
hardware
e.g., network, data, performance, clock drift
➜ need clear scope to cope with complexity
granularity
system under test
…
…
workload
…

granularity
for this assessment: nodes are black boxes
node := physical or virtual machines the system is composed of
i.e., nodes seen as “atoms” of system
➜ nodes are targets for fault injection

system under test
workload
measurements
campaign

workload
model
measurements
OpenStack
campaign
exercisederivation
system under test

case study
assessing the dependability of OpenStack

OpenStack
framework for building IaaS platforms
providing, e.g., VMs, storage, networking to its users
free, open-source software
Apache License 2.0
emerged from projects at RackSpace and NASA
i.e., from Cloud Files and OpenNebula
composed of services
communication to/between them mostly RESTful APIs via HTTP

OpenStack
services
dashboard
Horizon
auth*
Keystone
virtual
machine
image
Glance
object
Swift
block
Cinder
network
Neutron
compute
Nova
provides
images
might
use
provides
block
storage provides
network
manages
stores
backups
stores
images
provides
UI
provides
auth
VNC
access

OpenStack Fuel
OpenStack distribution
UIs to manage OpenStack instances
capable of installing a high availability setup
high availability :≈ as commonly used
(i.e., higher dependability than common comparable systems)

OpenStack Fuel
concept
Fuel master
configured
OpenStack
node
configured
OpenStack
node
OpenStack
node
getting
configured
unconfigured
node
unconfigured
node
master sends OS
images via PXE for
bootstrapping

OpenStack Fuel
OpenStack distribution
UIs to manage OpenStack instances
capable of installing a high availability setup
high availability :≈ as commonly used
(i.e., higher dependability than common comparable systems)
distributes services according to node roles

provides
UI
OpenStack Fuel
node roles
dashboard
Horizon
auth*
Keystone
virtual
machine
image
Glance
block
Cinder
network
Neutron
compute
Nova
object
Swiftprovides
images might
use
provides
block
storage
provides
network
manages
stores
backups
stores
images
provides
auth
VNC
access
storage
controller
compute

OpenStack Fuel
nodes and networks
controllercontrollercontroller storagestoragestoragecomputecompute
mgmt storage
publicprivate
master
PXEinternet

1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz
32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM
120GB HP 765479-B21 SSD (M.2 2280)
Ubuntu 16.04, Linux 4.4.0
OpenStack Fuel
deployment attempt 1
main insights:
A. unusably slow
B. unstable
a. too little RAM

OpenStack Fuel
2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz
384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM
1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache
Debian 8.6, Linux 4.7.0
main insights:
A. very hard (but possible) to
get sufficient performance
(esp. IO) with nested
virtualization
B. snapshot and restore of
running OpenStack via
outer VM does not work

OpenStack Fuel
deployment attempt 2: virtualization IO settings

OpenStack Fuel
2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz
384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM
1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache
Debian 8.6, Linux 4.7.0
main insights:
A. usable performance
B. snapshot and restore of
running OpenStack via inner
VMs does not work
a. VMs need shutdown
and reboot
Unfortunately, the physical host
became unavailable during the
assessment.

OpenStack Fuel
deployment used for assessments
1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz
32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM
120GB HP 765479-B21 SSD (M.2 2280)
Mellanox Connect-X3 Pro (Dual 10GbE, one used)
Ubuntu 16.04, Linux 4.4.0
each

OpenStack Fuel
deployment used for assessments
virtualized
easy programmatic machine crashes, freezes, etc.
virtualized hardware
i.e., full virtualization
enables programmatic injection in virtual hardware
virtualized networks
easy programmatic injection in network traffic
9 nodes distributed over 5 physical hosts
after a lot of experimentation with hardware & virtualization

workload
model
measurements
example workload
campaign
exercisederivation
system under test

system under test
model
measurements
example workload
campaign
exercisederivation
workload

example workload
Tahoe-LAFS
Least-Authority File Store
distributed file store
free, open-source, peer-to-peer
uses three main IaaS components
storage, compute, network
automatable through APIs
performance counters for analysis

example workload
Tahoe-LAFS grid
introducer
CLI
sftp
REST
…
node
node
node
trusted
node

example workload
storage stack

system under test
model
measurements
example workload
campaign
exercisederivation
workload

measurements
fault tree modeling
campaign
exercisederivation
system under test
model
workload

OpenStack Fuel
controller HA
https://docs.mirantis.com/openstack/fuel/fuel-8.0/mos-planning-guide.html#fuel-reference-architecture-overview
HAProxy
load balancing & failover
MySQL cluster
active/active
RabbitMQ cluster
active/active
pacemaker
cluster manager
e.g., trigger MySQL rebuild after network partitioning
corosync
manages quorum
e.g., stop services if controller not in quorum

fault tree modeling
alternative input: fault tree in dot format

measurements
fault tree modeling
campaign
exercisederivation
system under test
model
workload

measurements
system under test
campaign derivation
model
workload
campaign
exercisederivation

campaign derivation
derive campaign from dependability model
representative
i.e., injected faults are realistic & relevant
full coverage
i.e., test all modeled fault tolerance mechanisms
repeatable
i.e., ability to repeat exercising a campaign
efficient
i.e., a reasonable cost-benefit ratio

campaign derivation
dependability stress
F all fault injection points
≙ all basic events
S ⊆ F scenario
≙ a set of faults to inject
A := ℘(F) all possible scenarios
≙ all possible subsets of F
C ⊆ A campaign
≙ sets of faults to inject
s ∊ C ⇔ ?

campaign derivation
exercising all scenarios is not feasible
|℘(F)| = 2|F|
➜ exclude certain scenarios
scenarios which are expected to fail
re-using well-known algorithm for fault trees (MOCUS)
scenarios which are included in others
e.g., exclude {crash A} if there is {crash A, crash B}
i.e., “maximization” of scenarios

all |℘(F)| = 2|F|
scenarios
expect success
scenarios
campaign
expect failure
scenarios
non-maximal
scenarios
campaign derivation

campaign derivation
maximize “dependability stress”
i.e., inject as many faults simultaneously as tolerable
tests for synergistic failures
i.e., tests for interferences between fault tolerance mechanisms
since fault tolerance mechanisms are active simultaneously
increases efficiency
scenarios are “merged”
repeatable
derivation is deterministic

case study
campaign derivation
all scenarios: 2048
would take ~333h to exercise
maximal scenarios: 36
takes ~6h to exercise
…with 3 fault types & 10 runs per scenario:
all scenarios: 61440: ~416d to exercise
maximal scenarios: 1080 scenarios: ~7.3d to exercise

measurements
system under test
campaign derivation
model
workload
campaign
exercisederivation

measurements
system under test
exercise of the campaign
model
workload
campaign
derivation exercise

via user-provided implementation
executables in a specific directory structure
get state as CLI parameters
any language of preference
reuse existing tools
low level of integration
easily exchangeable
well-known pattern from UNIX-like OS
e.g., “hooks”, run-parts utility

directories for executables
1 ./pre-campaign
e.g., take snapshot of system, prepare logging
2-1 ./pre-scenario
e.g., restore snapshot, measure performance
2-2 ./event
invoked per fault to inject, e.g., power off a VM
2-3 ./post-scenario
e.g., measure performance
3 ./post-campaign
e.g., restore snapshot, trigger data analysis

exerciseofthecampaign
workers

measurements
system under test
model
workload
campaign
derivation exercise

system under test
results
model
workload
campaign
derivation exercise
measurements

results
measurements
perspective of OpenStack users
one specific use case
upload 100MB of random binary data to Tahoe-LAFS grid
before, during, after faults are active
collect performance counters from Tahoe-LAFS
analysis based on performance degradation
“how many times it took longer in the presence of faults”

results

results
degradation after activating stop faults
baseline

results
degradation after deactivating stop faults
baseline

system under test
results
model
workload
campaign
derivation exercise
measurements

derivation
system under test
conclusion
model
workload
campaign
exercise
measurements

conclusion
challenges
campaign derivation
with non-Boolean elements in the dependability model
complex distributed systems
getting a fully virtualized OpenStack setup
with performant nested virtualization
restoring from snapshots
after restore: boot all nodes & wait for OpenStack to be operable
automation
absent APIs, long run times (hard to debug), sporadic failures

conclusion
contributions
[previous slide] +
framework for software fault injection
model-based, automated, repeatable, flexible
dependability assessment of OpenStack Fuel
reusable implementation for exercising the campaign
baseline performance fluctuation
two fault types
crash, stop

? !Lukas Pirl | slideshare+sfi-os@lukas-pirl.de | http://lukas-pirl.de
supervision: Lena Feinbube, Prof. Dr. Andreas Polze
Operating Systems and Middleware Group
Hasso Plattner Institute for Digital Engineering
University of Potsdam
June 2017

identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack

Recommended

Recommended

More Related Content

Similar to identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack

Similar to identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack (20)

Recently uploaded

Recently uploaded (20)

identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack