This document discusses using chaos engineering and controlled experiments to proactively test disaster recovery plans. It recommends starting with small, isolated tests and expanding tests to larger scopes over time. The goals are to build confidence in disaster recovery plans, identify weaknesses, and limit the impact of failures. Regular testing is important as systems change frequently. The document provides examples of chaos engineering experiments and emphasizes documenting, rehearsing, and evaluating disaster recovery plans.
Embracing Chaos Engineering to Validate Disaster Recovery Plans
1. Avoiding Disasters by Embracing Chaos:
Validating Disaster Recovery with Chaos Engineering
2. Sebastian Straub
Principal Solutions Architect, N2WS
sebastian@n2ws.com
Taylor Smith
Product Marketing Manager, Gremlin
taylor.smith@gremlin.com
Meet our experts
3. Black Friday failures
Banks breaking
Airline incidents
Computer Problems Blamed For
Flight Delays
4.1.19
Citibank Website down, not working
2.28.19
Technical Issues Likely Cost Retailers
Billions
12.01.16
4. Availability vs. Rate of Change
Rate of Change
Availability
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
The Reliability Gap
Change introduces new forms of
failure that are difficult to see
before the fact....
- Richard Cook, How Complex Systems Fail
7. Engineers proactively test
to find and fix issues and
limit the impact of failures
And are more effective
when they work reactively
during an incident
8. Region 2
AZ 4AZ 3
Region 1
AZ 2AZ 1
How do you use Chaos Engineering
for Disaster Recovery?
1
Start small and expand the Blast Radius
1 Blackhole the connection to a database node, and see how
the application reacts.
2 Shutdown a container, pod, node to check for our Kubernetes
reaction.
3 Blackhole the connection to an entire Availability Zone.
4 4 Blackhole the connection to all the instances in an entire region.3
2
Active Passive
Key questions:
● How did our autoscaling, load balancers & gateways
react?
● Do we have enough redundancy in place?
● Did our monitoring & alerting trigger at the right time?
● Was our team able to react and recover fast enough?
9. The risks of unplanned downtime
If staff cannot access systems,
they cannot do their job
Potential damage to
reputation with suppliers,
partners, customers
Huge financial loss can result from
even one hour of downtime +
possible ransom/forensic costs
Data damage
Reputational impact Financial loss
Irreplaceable data damage as
a result of a malicious attack
Lost productivity
10. Compliance + Data Security
EBS Failure
Ransomware AZ Failure
Human
Error
Why breaking things should be practiced
“Everything fails all the time” —Werner Vogels
11. Build confidence in your DR plan
o Take stock: of IT assets
o Define critical resources: Identify the most critical AWS resources.
o Assess the risk: identify threats and define RTO/RPO
o Document the plan: identify gaps and single points of failure
o TEST: rehearse and evaluate your plan
13. Why N2WS?
N2WS backs up on an
instance level, including VPC
settings, security groups and
instance meta data
Recover anything from a
single file to your entire AWS
environment (yes, even
encrypted files)
Multi-tenancy
Manage multiple accounts
from 1 console ideal for
service providers or large
AWS environments
Restore AnythingVolumes vs. VMs
14. Your giant recovery button
Configure what to back up
and when - define backup
targets, frequency and
retention periods
Replicate snapshots to 1+
regions and recover quickly
in the event of any issue
Configure regular backups
of VPC settings and
recover to any region
VPC Capture and
Clone Tool
Cross-Region &
Cross-Account DR
Automated Policies
and Schedules
15. Design for failure: assume services do fail
o Reserve capacity to absorb AZ services failures: use reserved instances to
guarantee capacity
o Eliminate single points of failure: ensure you use services that are designed
for HA (e.g. using a NAT Gateway vs a NAT instance for internet access)
o Replicate data: replicate across different regions/accounts
o Create redundancy: create services using an active-passive or active-
active configuration
o Test: always test (and test again)!
Creating resiliency through Recovery failure injections
17. DRY RUN: Configure your recover
scenario prior to restore and be notified
of any potential configuration failure
Chaos Engineer DR: N2WS Recovery Scenarios
Execute pre and post backup scripts, define order of recovery
targets, enable a worker configuration test for S3 repositories
Automate a pre-defined recovery plan
and carry out ‘bulk’ DR drills recovering
multiple targets with ONE CLICK
18. Over HALF rely on
cross-region DR
Only 10% use cross-
account DR
Nearly 20% had NO
PLAN at all
Current Disaster Recovery Plan
23. Used by AWS builders, worldwide
AWS Accounts
5K+
Petabytes of Backup
13+
HUNDREDS of
THOUSANDS of
Protected Instances
THOUSANDS of
End-users & Service
Providers
24. Share your results!
Was it expected?
Did we detect it?
Did our system mitigate it?
What would be the impact?
How will we fix it?
How can we improve next time?
25. Migrate to the Cloud
Mitigate Dependency Failure
Shift to Cloud Native
Verify Monitoring
Train Teams
Where to get started
Test Disaster Recovery
26. Sign up for Gremlin Free
app.gremlin.com/signup
Sign up for N2WS Free Trial
n2ws.com/trial
Q&A