Cloud Resilience
Fault Injection for Increased Resilience
Jorge Cardoso
(jorge.cardoso@huawei.com)
Huawei European Research Center
Riesstraße 25, 80992 München
The Butterfly Effect Project
OpenStack Munich - Cloud Resilience &
Experiences with OpenStack
Wednesday, April 13, 2016
6:30 PM
1
FusionSphere from Huawei
#6
2
News from OpenStack
06 April 2016
3
FAILURES ARE INEVITABLE!
THE BEST WE CAN DO IS BE
PREPARED FOR THEM AND LEARN
FROM THEM
TEST, REPAIR, LEARN & PREDICT !
4
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
*Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
5
Google's 2007 found annualized failure
rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of
the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
6
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to try to ensure 100%
uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?
7
Technology Trends
GOOGLE TRENDS
CLOUD AVAILABILITY
CLOUD FAILURE
8
 Chaos Monkey
 Randomly terminates instances in a cluster
 Chaos Gorilla
 Simulate an Availability Zone becoming unavailable
 Chaos Kong
 Simulate an entire region outages
 Latency Monkey
 Introduce latency to network packets to simulate
degradation of the EC2 network
 Janitor Monkey
 Clean up unused resources
 Security Monkey
 Analyze and notify
on security profile changes
Netflix: Chaos Monkey
AWS recently recommended firms using
its infrastructure test their resilience by
using Chaos Monkey to induce failures
9
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015
Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region
10
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
11
 Google DIRT (Disaster Recovery Test)
 Annual disaster recovery & testing exercise
 8 years since inception
 Multi-day exercise triggering (controlled) failures in systems and process
 Premise
 30-day incapacitation of headquarters following a disaster
 Other offices and facilities may be affected
 When
 “Big disaster”: Annually for 3-5 days
 Continuous testing: Year-round
 Who
 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
 Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
12
Goal
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
13
Use Case: OpenStack Resiliency
Kill cinder database
(Simulate update failure)
Introduce delay in messages
(Full-scale traffic shows where
the real bottlenecks are)
Operation Error
OPENSTACK_KEYSTONE_URL = "http://%s:5000/v2.0" % OPENSTACK_HOST
Operation Error
/etc/nova/nova.conf
Delete: auth_strategy=keystone
Remove driver to HD
Remove access to NFS
(Simulate hardware failure)
Best way to avoid failure: Fail constantly
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
14
Use Case 1: Increasing Reliability
Public Cloud
Damage
Pattern
Butterfly Effect
Fix configurations
Fix bugs
Replace hardware
Upgrade memory
Fault Type
15
Use Case 2: Run Book Automation (RBA)
Public Cloud Incident Management
Is this really
an incident?
Major Incident
Procedure
Butterfly Effect
Fault Type
Damage
Pattern
Recovery
Script
16
MONITORING
Nagios Zabbix Cacti
StackTach Synaps Monasca
CONFIGURATION AUTOMATION
Ansible CFEngine Chef
Puppet Salt Heat
FAULT-INJECTION ENGINES
DestroyStack FSaaS
ChaosMonkey AnarchyApe
FAULT LIBRARIES AND PLANS
pyCallGraph Intellect
RunDeck Nose
DATA VISUALIZATION
Kibana Graylog2
Grafana
DAMAGE DETECTION
Tempest
Nose
DATA STORAGE
ElasticSearch OpenTSDB Neo4J
Graphite Cassandra Redis
DATA AGGREGATION
Logstash Collectd Flume
Fluentd Heka Ceilometer
MANUAL REPAIR
Bash Python
Chef Puppet
AUTOMATED REPAIR
jCOLIBRI myCBR Puppet
Rundeck (R)?ex Chef
DATA PROCESSING
Hadoop Pig
Hive Spark Storm
OPERATIONS ANALYTICS
Statsd R Panda
Weka Machine Leaning
ALERTING
Errbit Honeybadger Nagios
Zabbix OpenPager Riemann
DATA SOURCE
Log files Collectd Plg FlumeNG
OpenStack Tbls Zabbix Agt Nagios Plg
DATA TRANSPORT
rsyslog ZeroMQ
Components of a Solution
CONFIGURATION AUTOMATION
Ansible CFEngine Chef
Puppet Salt Heat
1
2 3
4
7
5
6
Design &
Deploy
Test
Infrastructure
Monitoring
Facilities
Design & Execute
Fault-Injection Plan
Identify Damages
Predict
Future Errors
Automatic
Repair
Repair & Learn
17
Technological Overview
 (1) Design & Deploy Test Environment
 Customizable, automated OpenStack deployment
 FusionServer RH2288 + VirtualBox + Vagrant + RDO
 (2) Design & Execute Fault-Injection Plan
 Language = Python (no DSL yet)
 Fault Engine = based on BPM
 Fault Plan = Workflow paradigm
 (3) Monitoring Facilities
 Monasca (from HP, RackSpace, IBM)
 Visualization with Grafana
 (4) Damage Detection
 OpenStack Tempest
 1200 tests (but only API testing :( )
 (5) Repair & Learn
 …
 (6) Predict Future Errors
 …
 (7) Automated Repair
 …
1
2
3
4
7
5
6
Design &
Deploy
Test
Infrastructure
Monitoring
Facilities
Design & Execute
Fault-Injection Plan
Damage Detection
Predict
Future Errors
Automatic
Repair
Repair & Learn
18
 Design & Deploy Test Environment
 Customizable, automated OpenStack deployment
 FusionServer RH2288 + VirtualBox + Vagrant + RDO
Deploy Test Environment
2 hours to deploy OpenStack infrastructure with 32 VMs
19
Faults to Inject
 Disk temporarily unavailable
 unmount a disk
 wait for replicas to regenerate
 remount the disk with the data intact
 wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
 Disk replacement
 unmount a disk
 wait for replicas regenerate
 delete the disk and remount it
 wait for replicas to regenerate
 Extra replicas from handoff nodes should get removed
 Expected failure
 damage three disks at the same time
 more if the replica count is higher
 check that the replicas didn’t regenerate even after some time period
 fail if the replicas regenerated
 this tests if the tests themselves are correct
 VM failures
 send VM creation request
 find compute node where request was scheduled
 damage to the compute server
 check if the VM creation was re-scheduled to another node
3
Inject Faults
20
Damage Detection
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
21
Zabbix and ELK
22
Monasca
 Overview: Uses the Keystone OpenStack Identity Service for authentication,
authorization and multi-tenancy. Monasca integrates with several other
OpenStack services such as Heat for auto-scaling and Ceilometer for
monitoring OpenStack resources.
 Apache Kafka: A high-throughput distributed messaging system. Kafka is a
central component in Monasca and provides the infranstructure for all internal
communications between components.
 Apache Storm: A free and open source distributed realtime computation
system. Apache Storm is used in the Monasca Threshold Engine.
 InfluxDB: An open-source distributed time series database with no external
dependencies. InfluxDB is one of the supported databases for storing metrics
and alarm history.
 MySQL: MySQL is one of the supported databases for the Monasca Config
Database.
 Grafana: An open source, feature rich metrics dashboard and graph editor.
Support for Monasca as a data source in Grafana has been added.
 Anomaly Detection: Engine implements real-time streaming anomaly detection.
Two algorithms: Numenta Platform for Intelligent Computing (NuPIC) and
Kolmogorov-Smirnov (K-S) Two Sample Test. Uses Stacktach for realtime
streaming.
 Performance: 3 HP Proliant SL390s G7 servers + InfluxDB cluster = 25K-30K
metrics/sec; monasca-api > 150K metrics/sec for a 3 node cluster with a load
balancing; for more performance use HP Vertica database.
See https://www.openstack.org/assets/presentation-media/Monasca-Deep-Dive-
Paris-Summit.pdf
Grafana (compute_instance_create_time)
Anomaly Detection (cpu.user_perc)
23
Application Domains
24
Join the Cause!
 Internship positions for MSc students
 Fault injection, fault models, fault libraries, fault plans,
brake and rebuild systems all day long, …
 OpenStack Engineers positions
 Rapid prototyping of cool ideas: propose it today,
code it, and show it running in 3 months…
 Innovative PoCs
 Solving difficult challenges of real problems using
quick and dirty prototyping
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY

Cloud Resilience with Open Stack

  • 1.
    Cloud Resilience Fault Injectionfor Increased Resilience Jorge Cardoso (jorge.cardoso@huawei.com) Huawei European Research Center Riesstraße 25, 80992 München The Butterfly Effect Project OpenStack Munich - Cloud Resilience & Experiences with OpenStack Wednesday, April 13, 2016 6:30 PM
  • 2.
  • 3.
  • 4.
    3 FAILURES ARE INEVITABLE! THEBEST WE CAN DO IS BE PREPARED FOR THEM AND LEARN FROM THEM TEST, REPAIR, LEARN & PREDICT !
  • 5.
    4 Unplanned downtime is causedby* software bugs … 27% hardware … 23% human error … 18% network failures … 17% natural disasters … 8% *Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
  • 6.
    5 Google's 2007 foundannualized failure rates (AFRs) for drives 1 year old 1.7% 3 year old >8.6% Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
  • 7.
    6 One reason [Netflix]:It’s the lack of control over the underlying hardware, the inability to configure it to try to ensure 100% uptime Why does using a cloud infrastructure requires advanced approaches for resiliency?
  • 8.
    7 Technology Trends GOOGLE TRENDS CLOUDAVAILABILITY CLOUD FAILURE
  • 9.
    8  Chaos Monkey Randomly terminates instances in a cluster  Chaos Gorilla  Simulate an Availability Zone becoming unavailable  Chaos Kong  Simulate an entire region outages  Latency Monkey  Introduce latency to network packets to simulate degradation of the EC2 network  Janitor Monkey  Clean up unused resources  Security Monkey  Analyze and notify on security profile changes Netflix: Chaos Monkey AWS recently recommended firms using its infrastructure test their resilience by using Chaos Monkey to induce failures
  • 10.
    9 Netflix: Chaos Monkey Feweralerts for ops team Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011 September 20th, 2015 Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1 Transfer traffic to east region
  • 11.
    10 A program designedto increase resilience by purposely injecting major failures Discover flaws and subtle dependencies Amazon AWS: GameDay “That seems totally bizarre on the face of it, but as you dig down, you end up finding some dependency no one knew about previously […] We’ve had situations where we brought down a network in, say, São Paulo, only to find that in doing so we broke our links in Mexico.”
  • 12.
    11  Google DIRT(Disaster Recovery Test)  Annual disaster recovery & testing exercise  8 years since inception  Multi-day exercise triggering (controlled) failures in systems and process  Premise  30-day incapacitation of headquarters following a disaster  Other offices and facilities may be affected  When  “Big disaster”: Annually for 3-5 days  Continuous testing: Year-round  Who  100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)  Business units (Human Resources, Finance, Safety, Crisis response etc.) Google: DiRT http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
  • 13.
    12 Goal -- Butterfly EffectSystem -- Enables to Automatically Test and Repair OpenStack and Cloud Applications CLOUD APPLICATION HUAWEI FusionSphere The system works by intentionally injecting different failures, test the ability to survive them, and learn how to predict and repair failures preemptively Failure Repair Test
  • 14.
    13 Use Case: OpenStackResiliency Kill cinder database (Simulate update failure) Introduce delay in messages (Full-scale traffic shows where the real bottlenecks are) Operation Error OPENSTACK_KEYSTONE_URL = "http://%s:5000/v2.0" % OPENSTACK_HOST Operation Error /etc/nova/nova.conf Delete: auth_strategy=keystone Remove driver to HD Remove access to NFS (Simulate hardware failure) Best way to avoid failure: Fail constantly The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
  • 15.
    14 Use Case 1:Increasing Reliability Public Cloud Damage Pattern Butterfly Effect Fix configurations Fix bugs Replace hardware Upgrade memory Fault Type
  • 16.
    15 Use Case 2:Run Book Automation (RBA) Public Cloud Incident Management Is this really an incident? Major Incident Procedure Butterfly Effect Fault Type Damage Pattern Recovery Script
  • 17.
    16 MONITORING Nagios Zabbix Cacti StackTachSynaps Monasca CONFIGURATION AUTOMATION Ansible CFEngine Chef Puppet Salt Heat FAULT-INJECTION ENGINES DestroyStack FSaaS ChaosMonkey AnarchyApe FAULT LIBRARIES AND PLANS pyCallGraph Intellect RunDeck Nose DATA VISUALIZATION Kibana Graylog2 Grafana DAMAGE DETECTION Tempest Nose DATA STORAGE ElasticSearch OpenTSDB Neo4J Graphite Cassandra Redis DATA AGGREGATION Logstash Collectd Flume Fluentd Heka Ceilometer MANUAL REPAIR Bash Python Chef Puppet AUTOMATED REPAIR jCOLIBRI myCBR Puppet Rundeck (R)?ex Chef DATA PROCESSING Hadoop Pig Hive Spark Storm OPERATIONS ANALYTICS Statsd R Panda Weka Machine Leaning ALERTING Errbit Honeybadger Nagios Zabbix OpenPager Riemann DATA SOURCE Log files Collectd Plg FlumeNG OpenStack Tbls Zabbix Agt Nagios Plg DATA TRANSPORT rsyslog ZeroMQ Components of a Solution CONFIGURATION AUTOMATION Ansible CFEngine Chef Puppet Salt Heat 1 2 3 4 7 5 6 Design & Deploy Test Infrastructure Monitoring Facilities Design & Execute Fault-Injection Plan Identify Damages Predict Future Errors Automatic Repair Repair & Learn
  • 18.
    17 Technological Overview  (1)Design & Deploy Test Environment  Customizable, automated OpenStack deployment  FusionServer RH2288 + VirtualBox + Vagrant + RDO  (2) Design & Execute Fault-Injection Plan  Language = Python (no DSL yet)  Fault Engine = based on BPM  Fault Plan = Workflow paradigm  (3) Monitoring Facilities  Monasca (from HP, RackSpace, IBM)  Visualization with Grafana  (4) Damage Detection  OpenStack Tempest  1200 tests (but only API testing :( )  (5) Repair & Learn  …  (6) Predict Future Errors  …  (7) Automated Repair  … 1 2 3 4 7 5 6 Design & Deploy Test Infrastructure Monitoring Facilities Design & Execute Fault-Injection Plan Damage Detection Predict Future Errors Automatic Repair Repair & Learn
  • 19.
    18  Design &Deploy Test Environment  Customizable, automated OpenStack deployment  FusionServer RH2288 + VirtualBox + Vagrant + RDO Deploy Test Environment 2 hours to deploy OpenStack infrastructure with 32 VMs
  • 20.
    19 Faults to Inject Disk temporarily unavailable  unmount a disk  wait for replicas to regenerate  remount the disk with the data intact  wait for replicas to regenerate the extra replicas from handoff nodes should get removed  Disk replacement  unmount a disk  wait for replicas regenerate  delete the disk and remount it  wait for replicas to regenerate  Extra replicas from handoff nodes should get removed  Expected failure  damage three disks at the same time  more if the replica count is higher  check that the replicas didn’t regenerate even after some time period  fail if the replicas regenerated  this tests if the tests themselves are correct  VM failures  send VM creation request  find compute node where request was scheduled  damage to the compute server  check if the VM creation was re-scheduled to another node 3 Inject Faults
  • 21.
    20 Damage Detection The maintesting framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces) Network tests • create keypairs • create security groups • create networks Compute tests • create a keypair • create a security group • boot a instance Swift tests • create a volume • get the volume • delete the volume Identity tests … Cinder tests … Glance tests … echo "$ tempest init cloud-01" echo "$ cp tempest/etc/tempest.conf cloud-01/etc/" echo "$ cd cloud-01" echo "Next is the full test suite:" echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'" echo "Next ist the minimum basic test:" echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
  • 22.
  • 23.
    22 Monasca  Overview: Usesthe Keystone OpenStack Identity Service for authentication, authorization and multi-tenancy. Monasca integrates with several other OpenStack services such as Heat for auto-scaling and Ceilometer for monitoring OpenStack resources.  Apache Kafka: A high-throughput distributed messaging system. Kafka is a central component in Monasca and provides the infranstructure for all internal communications between components.  Apache Storm: A free and open source distributed realtime computation system. Apache Storm is used in the Monasca Threshold Engine.  InfluxDB: An open-source distributed time series database with no external dependencies. InfluxDB is one of the supported databases for storing metrics and alarm history.  MySQL: MySQL is one of the supported databases for the Monasca Config Database.  Grafana: An open source, feature rich metrics dashboard and graph editor. Support for Monasca as a data source in Grafana has been added.  Anomaly Detection: Engine implements real-time streaming anomaly detection. Two algorithms: Numenta Platform for Intelligent Computing (NuPIC) and Kolmogorov-Smirnov (K-S) Two Sample Test. Uses Stacktach for realtime streaming.  Performance: 3 HP Proliant SL390s G7 servers + InfluxDB cluster = 25K-30K metrics/sec; monasca-api > 150K metrics/sec for a 3 node cluster with a load balancing; for more performance use HP Vertica database. See https://www.openstack.org/assets/presentation-media/Monasca-Deep-Dive- Paris-Summit.pdf Grafana (compute_instance_create_time) Anomaly Detection (cpu.user_perc)
  • 24.
  • 25.
    24 Join the Cause! Internship positions for MSc students  Fault injection, fault models, fault libraries, fault plans, brake and rebuild systems all day long, …  OpenStack Engineers positions  Rapid prototyping of cool ideas: propose it today, code it, and show it running in 3 months…  Innovative PoCs  Solving difficult challenges of real problems using quick and dirty prototyping
  • 26.
    Copyright©2015 Huawei TechnologiesCo., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY