Cloud Resilience with Open Stack

Cloud Resilience
Fault Injection for Increased Resilience
Jorge Cardoso
(jorge.cardoso@huawei.com)
Huawei European Research Center
Riesstraße 25, 80992 München
The Butterfly Effect Project
OpenStack Munich - Cloud Resilience &
Experiences with OpenStack
Wednesday, April 13, 2016
6:30 PM

2
News from OpenStack
06 April 2016

3
FAILURES ARE INEVITABLE!
THE BEST WE CAN DO IS BE
PREPARED FOR THEM AND LEARN
FROM THEM
TEST, REPAIR, LEARN & PREDICT !

4
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
*Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.

5
Google's 2007 found annualized failure
rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of
the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.

6
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to try to ensure 100%
uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?

7
Technology Trends
GOOGLE TRENDS
CLOUD AVAILABILITY
CLOUD FAILURE

8
 Chaos Monkey
 Randomly terminates instances in a cluster
 Chaos Gorilla
 Simulate an Availability Zone becoming unavailable
 Chaos Kong
 Simulate an entire region outages
 Latency Monkey
 Introduce latency to network packets to simulate
degradation of the EC2 network
 Janitor Monkey
 Clean up unused resources
 Security Monkey
 Analyze and notify
on security profile changes
Netflix: Chaos Monkey
AWS recently recommended firms using
its infrastructure test their resilience by
using Chaos Monkey to induce failures

9
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015
Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region

10
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”

11
 Google DIRT (Disaster Recovery Test)
 Annual disaster recovery & testing exercise
 8 years since inception
 Multi-day exercise triggering (controlled) failures in systems and process
 Premise
 30-day incapacitation of headquarters following a disaster
 Other offices and facilities may be affected
 When
 “Big disaster”: Annually for 3-5 days
 Continuous testing: Year-round
 Who
 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
 Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf

12
Goal
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test

13
Use Case: OpenStack Resiliency
Kill cinder database
(Simulate update failure)
Introduce delay in messages
(Full-scale traffic shows where
the real bottlenecks are)
Operation Error
OPENSTACK_KEYSTONE_URL = "http://%s:5000/v2.0" % OPENSTACK_HOST
Operation Error
/etc/nova/nova.conf
Delete: auth_strategy=keystone
Remove driver to HD
Remove access to NFS
(Simulate hardware failure)
Best way to avoid failure: Fail constantly
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)

14
Use Case 1: Increasing Reliability
Public Cloud
Damage
Pattern
Butterfly Effect
Fix configurations
Fix bugs
Replace hardware
Upgrade memory
Fault Type

15
Use Case 2: Run Book Automation (RBA)
Public Cloud Incident Management
Is this really
an incident?
Major Incident
Procedure
Butterfly Effect
Fault Type
Damage
Pattern
Recovery
Script

16
MONITORING
Nagios Zabbix Cacti
StackTach Synaps Monasca
CONFIGURATION AUTOMATION
Ansible CFEngine Chef
Puppet Salt Heat
FAULT-INJECTION ENGINES
DestroyStack FSaaS
ChaosMonkey AnarchyApe
FAULT LIBRARIES AND PLANS
pyCallGraph Intellect
RunDeck Nose
DATA VISUALIZATION
Kibana Graylog2
Grafana
DAMAGE DETECTION
Tempest
Nose
DATA STORAGE
ElasticSearch OpenTSDB Neo4J
Graphite Cassandra Redis
DATA AGGREGATION
Logstash Collectd Flume
Fluentd Heka Ceilometer
MANUAL REPAIR
Bash Python
Chef Puppet
AUTOMATED REPAIR
jCOLIBRI myCBR Puppet
Rundeck (R)?ex Chef
DATA PROCESSING
Hadoop Pig
Hive Spark Storm
OPERATIONS ANALYTICS
Statsd R Panda
Weka Machine Leaning
ALERTING
Errbit Honeybadger Nagios
Zabbix OpenPager Riemann
DATA SOURCE
Log files Collectd Plg FlumeNG
OpenStack Tbls Zabbix Agt Nagios Plg
DATA TRANSPORT
rsyslog ZeroMQ
Components of a Solution
CONFIGURATION AUTOMATION
Ansible CFEngine Chef
Puppet Salt Heat
1
2 3
4
7
5
6
Design &
Deploy
Test
Infrastructure
Monitoring
Facilities
Design & Execute
Fault-Injection Plan
Identify Damages
Predict
Future Errors
Automatic
Repair
Repair & Learn

17
Technological Overview
 (1) Design & Deploy Test Environment
 Customizable, automated OpenStack deployment
 FusionServer RH2288 + VirtualBox + Vagrant + RDO
 (2) Design & Execute Fault-Injection Plan
 Language = Python (no DSL yet)
 Fault Engine = based on BPM
 Fault Plan = Workflow paradigm
 (3) Monitoring Facilities
 Monasca (from HP, RackSpace, IBM)
 Visualization with Grafana
 (4) Damage Detection
 OpenStack Tempest
 1200 tests (but only API testing :( )
 (5) Repair & Learn
 …
 (6) Predict Future Errors
 …
 (7) Automated Repair
 …
1
2
3
4
7
5
6
Design &
Deploy
Test
Infrastructure
Monitoring
Facilities
Design & Execute
Fault-Injection Plan
Damage Detection
Predict
Future Errors
Automatic
Repair
Repair & Learn

18
 Design & Deploy Test Environment
 Customizable, automated OpenStack deployment
 FusionServer RH2288 + VirtualBox + Vagrant + RDO
Deploy Test Environment
2 hours to deploy OpenStack infrastructure with 32 VMs

19
Faults to Inject
 Disk temporarily unavailable
 unmount a disk
 wait for replicas to regenerate
 remount the disk with the data intact
 wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
 Disk replacement
 unmount a disk
 wait for replicas regenerate
 delete the disk and remount it
 wait for replicas to regenerate
 Extra replicas from handoff nodes should get removed
 Expected failure
 damage three disks at the same time
 more if the replica count is higher
 check that the replicas didn’t regenerate even after some time period
 fail if the replicas regenerated
 this tests if the tests themselves are correct
 VM failures
 send VM creation request
 find compute node where request was scheduled
 damage to the compute server
 check if the VM creation was re-scheduled to another node
3
Inject Faults

20
Damage Detection
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"

22
Monasca
 Overview: Uses the Keystone OpenStack Identity Service for authentication,
authorization and multi-tenancy. Monasca integrates with several other
OpenStack services such as Heat for auto-scaling and Ceilometer for
monitoring OpenStack resources.
 Apache Kafka: A high-throughput distributed messaging system. Kafka is a
central component in Monasca and provides the infranstructure for all internal
communications between components.
 Apache Storm: A free and open source distributed realtime computation
system. Apache Storm is used in the Monasca Threshold Engine.
 InfluxDB: An open-source distributed time series database with no external
dependencies. InfluxDB is one of the supported databases for storing metrics
and alarm history.
 MySQL: MySQL is one of the supported databases for the Monasca Config
Database.
 Grafana: An open source, feature rich metrics dashboard and graph editor.
Support for Monasca as a data source in Grafana has been added.
 Anomaly Detection: Engine implements real-time streaming anomaly detection.
Two algorithms: Numenta Platform for Intelligent Computing (NuPIC) and
Kolmogorov-Smirnov (K-S) Two Sample Test. Uses Stacktach for realtime
streaming.
 Performance: 3 HP Proliant SL390s G7 servers + InfluxDB cluster = 25K-30K
metrics/sec; monasca-api > 150K metrics/sec for a 3 node cluster with a load
balancing; for more performance use HP Vertica database.
See https://www.openstack.org/assets/presentation-media/Monasca-Deep-Dive-
Paris-Summit.pdf
Grafana (compute_instance_create_time)
Anomaly Detection (cpu.user_perc)

24
Join the Cause!
 Internship positions for MSc students
 Fault injection, fault models, fault libraries, fault plans,
brake and rebuild systems all day long, …
 OpenStack Engineers positions
 Rapid prototyping of cool ideas: propose it today,
code it, and show it running in 3 months…
 Innovative PoCs
 Solving difficult challenges of real problems using
quick and dirty prototyping

Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY

Cloud Resilience with Open Stack

More Related Content

What's hot

Viewers also liked

Similar to Cloud Resilience with Open Stack

More from Jorge Cardoso

Recently uploaded

Cloud Resilience with Open Stack