Embracing Chaos Engineering to Validate Disaster Recovery Plans

Avoiding Disasters by Embracing Chaos:
Validating Disaster Recovery with Chaos Engineering

Sebastian Straub
Principal Solutions Architect, N2WS
sebastian@n2ws.com
Taylor Smith
Product Marketing Manager, Gremlin
taylor.smith@gremlin.com
Meet our experts

Black Friday failures
Banks breaking
Airline incidents
Computer Problems Blamed For
Flight Delays
4.1.19
Citibank Website down, not working
2.28.19
Technical Issues Likely Cost Retailers
Billions
12.01.16

Availability vs. Rate of Change
Rate of Change
Availability
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
The Reliability Gap
Change introduces new forms of
failure that are difficult to see
before the fact....
- Richard Cook, How Complex Systems Fail

Thoughtful, controlled
experiments designed to reveal
the weakness in our systems.
Chaos Engineering
Test
People
Processes
Application
Infrastructure

Progressively test
your system to
isolate problems
and mitigate risk

Engineers proactively test
to find and fix issues and
limit the impact of failures
And are more effective
when they work reactively
during an incident

Region 2
AZ 4AZ 3
Region 1
AZ 2AZ 1
How do you use Chaos Engineering
for Disaster Recovery?
1
Start small and expand the Blast Radius
1 Blackhole the connection to a database node, and see how
the application reacts.
2 Shutdown a container, pod, node to check for our Kubernetes
reaction.
3 Blackhole the connection to an entire Availability Zone.
4 4 Blackhole the connection to all the instances in an entire region.3
2
Active Passive
Key questions:
● How did our autoscaling, load balancers & gateways
react?
● Do we have enough redundancy in place?
● Did our monitoring & alerting trigger at the right time?
● Was our team able to react and recover fast enough?

The risks of unplanned downtime
If staff cannot access systems,
they cannot do their job
Potential damage to
reputation with suppliers,
partners, customers
Huge financial loss can result from
even one hour of downtime +
possible ransom/forensic costs
Data damage
Reputational impact Financial loss
Irreplaceable data damage as
a result of a malicious attack
Lost productivity

Compliance + Data Security
EBS Failure
Ransomware AZ Failure
Human
Error
Why breaking things should be practiced
“Everything fails all the time” —Werner Vogels

Build confidence in your DR plan
o Take stock: of IT assets
o Define critical resources: Identify the most critical AWS resources.
o Assess the risk: identify threats and define RTO/RPO
o Document the plan: identify gaps and single points of failure
o TEST: rehearse and evaluate your plan

The promise of the cloud…delivered

Why N2WS?
N2WS backs up on an
instance level, including VPC
settings, security groups and
instance meta data
Recover anything from a
single file to your entire AWS
environment (yes, even
encrypted files)
Multi-tenancy
Manage multiple accounts
from 1 console ideal for
service providers or large
AWS environments
Restore AnythingVolumes vs. VMs

Your giant recovery button
Configure what to back up
and when - define backup
targets, frequency and
retention periods
Replicate snapshots to 1+
regions and recover quickly
in the event of any issue
Configure regular backups
of VPC settings and
recover to any region
VPC Capture and
Clone Tool
Cross-Region &
Cross-Account DR
Automated Policies
and Schedules

Design for failure: assume services do fail
o Reserve capacity to absorb AZ services failures: use reserved instances to
guarantee capacity
o Eliminate single points of failure: ensure you use services that are designed
for HA (e.g. using a NAT Gateway vs a NAT instance for internet access)
o Replicate data: replicate across different regions/accounts
o Create redundancy: create services using an active-passive or active-
active configuration
o Test: always test (and test again)!
Creating resiliency through Recovery failure injections

N2WS 2020 Cloud Report Survey
20% NEVER perform
recovery drills!

DRY RUN: Configure your recover
scenario prior to restore and be notified
of any potential configuration failure
Chaos Engineer DR: N2WS Recovery Scenarios
Execute pre and post backup scripts, define order of recovery
targets, enable a worker configuration test for S3 repositories
Automate a pre-defined recovery plan
and carry out ‘bulk’ DR drills recovering
multiple targets with ONE CLICK

Over HALF rely on
cross-region DR
Only 10% use cross-
account DR
Nearly 20% had NO
PLAN at all
Current Disaster Recovery Plan

Cross-region data protection
Protect against
regional outages
with cross-region
disaster recovery

Cross-account data protection
Protect against
account
compromises with
cross-account
disaster recovery

The ultimate data protection
Snapshot Vault
Use BOTH cross-
region and cross-
account DR to
create a highly
secure “snapshot
vault”

Used by AWS builders, worldwide
AWS Accounts
5K+
Petabytes of Backup
13+
HUNDREDS of
THOUSANDS of
Protected Instances
THOUSANDS of
End-users & Service
Providers

Share your results!
Was it expected?
Did we detect it?
Did our system mitigate it?
What would be the impact?
How will we fix it?
How can we improve next time?

Migrate to the Cloud
Mitigate Dependency Failure
Shift to Cloud Native
Verify Monitoring
Train Teams
Where to get started
Test Disaster Recovery

Sign up for Gremlin Free
app.gremlin.com/signup
Sign up for N2WS Free Trial
n2ws.com/trial
Q&A

Embracing Chaos Engineering to Validate Disaster Recovery Plans

Recommended

Recommended

More Related Content

Similar to Embracing Chaos Engineering to Validate Disaster Recovery Plans

Similar to Embracing Chaos Engineering to Validate Disaster Recovery Plans (20)

More from OK2OK

More from OK2OK (11)

Recently uploaded

Recently uploaded (20)

Embracing Chaos Engineering to Validate Disaster Recovery Plans