identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack
Considering the growing importance of distributed systems in today’s society, such as for cloud computing applications, the need for ensuring the systems’ dependability intensifies likewise. In the context of complex, fast-evolving distributed systems, the approach of software fault injection for their experimental dependability assessment does not seem to be unfolded to its full potential yet.
This work presents a structured method to derive a software fault injection campaign from a user-provided dependability model. The deterministically derived campaign aims to test for synergistic effects and therefore identifies all combinations of as many concurrently tolerable faults as possible, while also optimizing for efficiency. An implemented tool can programmatically derive the campaign and coordinate its exercise via user-provided executables, which, in turn, implement the specifics of the actual dependability assessment.
For the case study on OpenStack, its fault tolerance mechanisms are elaborated on and consolidated in a dependability model, represented as a fault tree. Practical evaluations of a fully virtualized high availability OpenStack show that its setup, as well as its checkpointing and rollback are challenging. The efforts finally yielded a setup which allowed to implement the completely automated exercise of the campaign. On top of OpenStack, the application Tahoe Least Authority File Store is used as a workload, and to measure the performance degradation while exercising the campaign.
The high availability OpenStack setup does not reveal any outstandingly critical components, and Tahoe-LAFS experiences performance degradations between factor ~1 and ~4, with an average of ~1.8 in the presence of the injected faults.
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...Neo4j
More Related Content
Similar to identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniquesijtsrd
Similar to identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack (20)
identification and exercise of fault injection campaigns for experimental dependability assessment of distributed systems exemplified by a case study on open stack
1. fault injection campaigns for
experimental dependability assessments
of distributed systems
exemplified by a case study on OpenStack
Lukas Pirl
supervision: Lena Feinbube, Prof. Dr. Andreas Polze
Operating Systems and Middleware Group
Hasso Plattner Institute for Digital Engineering
University of Potsdam
June 2017
identification and exercise of
2. about this presentation
high-level to deliver concepts
few technical details about non-contributions
e.g., no listings of software used
details upon request, of course
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 2
3. introduction
motivation
software systems are increasingly …
… critical
failures can cause economic, ecologic, or even personal damage
… complex
higher probability of, e.g., design flaws, bugs, interferences
… distributed
notoriously hard
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 3
4. introduction
dependability
dependability is a measure of quality
allows comparisons
and thus optimization
dependability := the ability to deliver
service that can justifiably be trusted ¹
note: this does not imply, e.g., 100% availability: “service” depends on specification
1: A. Avižienis, J.-C. Laprie, and B. Randell, “Dependability and Its Threats: A Taxonomy,” in Building the Information Society, Springer, Boston, MA, 2004, pp. 91–120.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 4
5. introduction
dependability
concerns whole society
individuals
e.g., banking, communication
enterprises
e.g., resource planning, plant control
politics
e.g., electronic voting, taxing
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 5
6. introduction
dependability
concerns all deployment models
single software systems
e.g., in medical applications, such as pacemakers
distributed systems
e.g., in traffic, such as train protection
Cloud (i.e., IaaS)
e.g., in research, such as genome analysis
services (i.e., SaaS)
e.g., in information management, such as data storage
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 6
7. fault tolerant distributed systems do fail¹
2.5h Facebook outage 2010
“friendly” DDoS due to wrong configuration value
8h Azure outage 2012
leap day bug in SSL certificate generation
4.5h Amazon S3 outage 2017
typo in manual command took “too many” servers down
introduction
war stories
see this link for visualizations and more examples, 1: D. Oppenheimer et al “Why do Internet services fail, and what can be done about it?,” in USENIX symposium on internet techn. and systems, 2003, vol. 67.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 7
8. introduction
fault injection
fault injection
⊂ testing
experimental dependability assessment
with relatively low complexity
concept:
1. forcefully introduce faults
fault := suspected error cause
2. assess the delivered quality of service
long-established for testing hardware
testing
fault
injection
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 8
9. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 9
10. introduction
software fault injection
software fault injection
implemented in software and targeting software
!= HWIFI: hardware-implemented & targeting hardware
!= SWIFI: software-implemented & targeting hardware
not yet widely adopted for testing software
Missing accessibility?
tools too specialized?
available information too scattered/heterogeneous/theoretical?
Missing automation?
e.g., in comparison unit testing
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 10
11. introduction
fundamental questions for assessments
very specific to the system under test
What to inject? ➜ fault model
fault classes, e.g., crash faults
When to inject? ➜ trigger mechanism
when injecting during runtime, likely chosen according to workload
Where to inject? ➜ dependability model
at spots where faults should be tolerated
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 11
12. introduction
software fault injection – success stories
fault injection in production
Etsy ¹
Netflix
Chaos Monkey
terminates Amazon EC2 instances
in Auto Scaling Groups
during business hours only
staff is watching and can react quickly
1: J. Allspaw, “Fault injection in production,” Communications of the ACM, vol. 55, no. 10, pp. 48–52, 2012.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 12
13. D. Avresky, J. Arlat, J.-C. Laprie, and Y. Crouzet, “Fault injection for formal testing of fault tolerance,” IEEE Transactions on Reliability, vol. 45, no. 3, pp. 443–455, 1996.
introduction
related work
programmatic identification of the fault load
extracted from code
would testing be self-fulfilling?
extracted from formal description
laborious, complicated
➜ little incorporation of users’ expectations
user := person assessing the dependability of a service
e.g., developers, architects
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 13
14. introduction
related work
tools for injecting faults
tend to be highly specialized
e.g., programming language, execution environments, APIs
➜ hard to adapt
tend to employ a fine granularity
and with a lot of components in many layers of abstraction?
➜ inappropriate for complex systems
too many injection points / too complex
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 14
15. research question
How can we make software fault injection
for complex distributed systems
more easily accessible and integratable
in software development practices?
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 15
16. approach
model-based
capture users’ expectations in a comprehensible format
full automation
campaign derivation and exercise
campaign := all fault injection experiments for specific assessment
case study
on OpenStack
recent, complex, distributed, promises high-availability
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 16
17. system under test
approach
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 17
measurements
model
automation
workload
18. system under test
workload
measurements
approach
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 18
campaign
exercisederivationmodel
19. system under test
workload
measurements
dependability modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 19
campaign
exercisederivationmodel
20. dependability modeling
models
to capture users’ expectations
comprehensible by non-developers
commonly used to cope with complexity
usually formal
machine-readable
fosters representativeness
because incorporates users’ experience
e.g., structure of the system under test, granularity, fault types
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 20
21. dependability modeling
fault tree diagrams
for root cause analyses
top-down, deductive
used up-front or for accident investigation
Boolean logic
extensions for timing, dependencies, etc.
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 21
22. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 22
23. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 23
24. fault tree diagrams
online editor available
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 24
25. Where to inject?
system under test
e.g., source code
interfaces of system under test
e.g., APIs
interfaces built on
e.g., other services, host OS
hardware
e.g., network, data, performance, clock drift
➜ need clear scope to cope with complexity
dependability modeling
granularity
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 25
system under test
…
…
workload
…
26. dependability modeling
granularity
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 26
for this assessment: nodes are black boxes
node := physical or virtual machines the system is composed of
i.e., nodes seen as “atoms” of system
➜ nodes are targets for fault injection
27. system under test
workload
measurements
dependability modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 27
campaign
exercisederivationmodel
28. workload
model
measurements
OpenStack
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 28
campaign
exercisederivation
system under test
29. case study
assessing the dependability of OpenStack
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 29
30. OpenStack
framework for building IaaS platforms
providing, e.g., VMs, storage, networking to its users
free, open-source software
Apache License 2.0
emerged from projects at RackSpace and NASA
i.e., from Cloud Files and OpenNebula
composed of services
communication to/between them mostly RESTful APIs via HTTP
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 30
31. OpenStack
services
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 31
dashboard
Horizon
auth*
Keystone
virtual
machine
image
Glance
object
Swift
block
Cinder
network
Neutron
compute
Nova
provides
images
might
use
provides
block
storage provides
network
manages
stores
backups
stores
images
provides
UI
provides
auth
VNC
access
32. OpenStack Fuel
OpenStack distribution
UIs to manage OpenStack instances
capable of installing a high availability setup
high availability :≈ as commonly used
(i.e., higher dependability than common comparable systems)
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 32
33. OpenStack Fuel
concept
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 33
Fuel master
configured
OpenStack
node
configured
OpenStack
node
OpenStack
node
getting
configured
unconfigured
node
unconfigured
node
master sends OS
images via PXE for
bootstrapping
34. OpenStack Fuel
OpenStack distribution
UIs to manage OpenStack instances
capable of installing a high availability setup
high availability :≈ as commonly used
(i.e., higher dependability than common comparable systems)
distributes services according to node roles
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 34
35. provides
UI
OpenStack Fuel
node roles
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 35
dashboard
Horizon
auth*
Keystone
virtual
machine
image
Glance
block
Cinder
network
Neutron
compute
Nova
object
Swiftprovides
images might
use
provides
block
storage
provides
network
manages
stores
backups
stores
images
provides
auth
VNC
access
storage
controller
compute
36. OpenStack Fuel
nodes and networks
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 36
controllercontrollercontroller storagestoragestoragecomputecompute
mgmt storage
publicprivate
master
PXEinternet
37. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 37
1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz
32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM
120GB HP 765479-B21 SSD (M.2 2280)
Ubuntu 16.04, Linux 4.4.0
OpenStack Fuel
deployment attempt 1
main insights:
A. unusably slow
B. unstable
a. too little RAM
38. OpenStack Fuel
deployment attempt 2
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 38
2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz
384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM
1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache
Debian 8.6, Linux 4.7.0
main insights:
A. very hard (but possible) to
get sufficient performance
(esp. IO) with nested
virtualization
B. snapshot and restore of
running OpenStack via
outer VM does not work
39. OpenStack Fuel
deployment attempt 2: virtualization IO settings
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 39
40. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 40
OpenStack Fuel
deployment attempt 3
2 × Intel Xeon E5-2630, 6C/12T, 2.30GHz
384GB, 24 × 16GB PC3-12800 (DDR3–1600) DIMM
1TB RAID-1, 2 × 1TB 7.200 RPM, 64MB Cache
Debian 8.6, Linux 4.7.0
main insights:
A. usable performance
B. snapshot and restore of
running OpenStack via inner
VMs does not work
a. VMs need shutdown
and reboot
Unfortunately, the physical host
became unavailable during the
assessment.
41. OpenStack Fuel
deployment used for assessments
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 41
1 x Intel Xeon E3-1284L v4, 4C/8T, 2.90GHz
32GB, 4 × 8GB PC3L-12800 (DDR3–1600) SODIMM
120GB HP 765479-B21 SSD (M.2 2280)
Mellanox Connect-X3 Pro (Dual 10GbE, one used)
Ubuntu 16.04, Linux 4.4.0
each
42. OpenStack Fuel
deployment used for assessments
virtualized
easy programmatic machine crashes, freezes, etc.
virtualized hardware
i.e., full virtualization
enables programmatic injection in virtual hardware
virtualized networks
easy programmatic injection in network traffic
9 nodes distributed over 5 physical hosts
after a lot of experimentation with hardware & virtualization
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 42
43. workload
model
measurements
example workload
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 43
campaign
exercisederivation
system under test
44. system under test
model
measurements
example workload
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 44
campaign
exercisederivation
workload
45. example workload
Tahoe-LAFS
Least-Authority File Store
distributed file store
free, open-source, peer-to-peer
uses three main IaaS components
storage, compute, network
automatable through APIs
performance counters for analysis
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 45
46. example workload
Tahoe-LAFS grid
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 46
introducer
CLI
sftp
REST
…
node
node
node
trusted
node
47. example workload
storage stack
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 47
48. system under test
model
measurements
example workload
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 48
campaign
exercisederivation
workload
49. measurements
fault tree modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 49
campaign
exercisederivation
system under test
model
workload
50. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 50
51. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 51
52. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 52
54. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 54
55. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 55
56. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 56
57. fault tree modeling
alternative input: fault tree in dot format
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 57
58. measurements
fault tree modeling
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 58
campaign
exercisederivation
system under test
model
workload
59. measurements
system under test
campaign derivation
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 59
model
workload
campaign
exercisederivation
60. campaign derivation
derive campaign from dependability model
representative
i.e., injected faults are realistic & relevant
full coverage
i.e., test all modeled fault tolerance mechanisms
repeatable
i.e., ability to repeat exercising a campaign
efficient
i.e., a reasonable cost-benefit ratio
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 60
61. campaign derivation
dependability stress
F all fault injection points
≙ all basic events
S ⊆ F scenario
≙ a set of faults to inject
A := ℘(F) all possible scenarios
≙ all possible subsets of F
C ⊆ A campaign
≙ sets of faults to inject
s ∊ C ⇔ ?
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 61
62. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 62
campaign derivation
dependability stress
exercising all scenarios is not feasible
|℘(F)| = 2|F|
➜ exclude certain scenarios
scenarios which are expected to fail
re-using well-known algorithm for fault trees (MOCUS)
scenarios which are included in others
e.g., exclude {crash A} if there is {crash A, crash B}
i.e., “maximization” of scenarios
63. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 63
all |℘(F)| = 2|F|
scenarios
expect success
scenarios
campaign
expect failure
scenarios
non-maximal
scenarios
campaign derivation
dependability stress
64. campaign derivation
dependability stress
maximize “dependability stress”
i.e., inject as many faults simultaneously as tolerable
tests for synergistic failures
i.e., tests for interferences between fault tolerance mechanisms
since fault tolerance mechanisms are active simultaneously
increases efficiency
scenarios are “merged”
repeatable
derivation is deterministic
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 64
65. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 65
case study
campaign derivation
all scenarios: 2048
would take ~333h to exercise
maximal scenarios: 36
takes ~6h to exercise
…with 3 fault types & 10 runs per scenario:
all scenarios: 61440: ~416d to exercise
maximal scenarios: 1080 scenarios: ~7.3d to exercise
66. measurements
system under test
campaign derivation
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 66
model
workload
campaign
exercisederivation
67. measurements
system under test
exercise of the campaign
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 67
model
workload
campaign
derivation exercise
68. via user-provided implementation
executables in a specific directory structure
get state as CLI parameters
any language of preference
reuse existing tools
low level of integration
easily exchangeable
well-known pattern from UNIX-like OS
e.g., “hooks”, run-parts utility
exercise of the campaign
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 68
69. exercise of the campaign
directories for executables
1 ./pre-campaign
e.g., take snapshot of system, prepare logging
2-1 ./pre-scenario
e.g., restore snapshot, measure performance
2-2 ./event
invoked per fault to inject, e.g., power off a VM
2-3 ./post-scenario
e.g., measure performance
3 ./post-campaign
e.g., restore snapshot, trigger data analysis
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 69
70. exerciseofthecampaign
workers
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 70
71. measurements
system under test
exercise of the campaign
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 71
model
workload
campaign
derivation exercise
72. system under test
results
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 72
model
workload
campaign
derivation exercise
measurements
73. results
measurements
perspective of OpenStack users
one specific use case
upload 100MB of random binary data to Tahoe-LAFS grid
before, during, after faults are active
collect performance counters from Tahoe-LAFS
analysis based on performance degradation
“how many times it took longer in the presence of faults”
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 73
74. results
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 74
75. results
degradation after activating stop faults
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 75
baseline
76. results
degradation after deactivating stop faults
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 76
baseline
77. system under test
results
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 77
model
workload
campaign
derivation exercise
measurements
78. derivation
system under test
conclusion
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 78
model
workload
campaign
exercise
measurements
79. conclusion
challenges
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 79
campaign derivation
with non-Boolean elements in the dependability model
complex distributed systems
getting a fully virtualized OpenStack setup
with performant nested virtualization
restoring from snapshots
after restore: boot all nodes & wait for OpenStack to be operable
automation
absent APIs, long run times (hard to debug), sporadic failures
80. conclusion
contributions
[previous slide] +
framework for software fault injection
model-based, automated, repeatable, flexible
dependability assessment of OpenStack Fuel
reusable implementation for exercising the campaign
baseline performance fluctuation
two fault types
crash, stop
Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 80
81. Lukas Pirl | Identification & Exercise of Fault Injection Campaigns for Distributed Systems / Case Study OpenStack | June 2017 | Hasso Plattner Institute Potsdam | 81
? !Lukas Pirl | slideshare+sfi-os@lukas-pirl.de | http://lukas-pirl.de
supervision: Lena Feinbube, Prof. Dr. Andreas Polze
Operating Systems and Middleware Group
Hasso Plattner Institute for Digital Engineering
University of Potsdam
June 2017