SlideShare a Scribd company logo
1 of 26
Avoiding Disasters by Embracing Chaos:
Validating Disaster Recovery with Chaos Engineering
Sebastian Straub
Principal Solutions Architect, N2WS
sebastian@n2ws.com
Taylor Smith
Product Marketing Manager, Gremlin
taylor.smith@gremlin.com
Meet our experts
Black Friday failures
Banks breaking
Airline incidents
Computer Problems Blamed For
Flight Delays
4.1.19
Citibank Website down, not working
2.28.19
Technical Issues Likely Cost Retailers
Billions
12.01.16
Availability vs. Rate of Change
Rate of Change
Availability
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
The Reliability Gap
Change introduces new forms of
failure that are difficult to see
before the fact....
- Richard Cook, How Complex Systems Fail
Thoughtful, controlled
experiments designed to reveal
the weakness in our systems.
Chaos Engineering
Test
People
Processes
Application
Infrastructure
Progressively test
your system to
isolate problems
and mitigate risk
Engineers proactively test
to find and fix issues and
limit the impact of failures
And are more effective
when they work reactively
during an incident
Region 2
AZ 4AZ 3
Region 1
AZ 2AZ 1
How do you use Chaos Engineering
for Disaster Recovery?
1
Start small and expand the Blast Radius
1 Blackhole the connection to a database node, and see how
the application reacts.
2 Shutdown a container, pod, node to check for our Kubernetes
reaction.
3 Blackhole the connection to an entire Availability Zone.
4 4 Blackhole the connection to all the instances in an entire region.3
2
Active Passive
Key questions:
● How did our autoscaling, load balancers & gateways
react?
● Do we have enough redundancy in place?
● Did our monitoring & alerting trigger at the right time?
● Was our team able to react and recover fast enough?
The risks of unplanned downtime
If staff cannot access systems,
they cannot do their job
Potential damage to
reputation with suppliers,
partners, customers
Huge financial loss can result from
even one hour of downtime +
possible ransom/forensic costs
Data damage
Reputational impact Financial loss
Irreplaceable data damage as
a result of a malicious attack
Lost productivity
Compliance + Data Security
EBS Failure
Ransomware AZ Failure
Human
Error
Why breaking things should be practiced
“Everything fails all the time” —Werner Vogels
Build confidence in your DR plan
o Take stock: of IT assets
o Define critical resources: Identify the most critical AWS resources.
o Assess the risk: identify threats and define RTO/RPO
o Document the plan: identify gaps and single points of failure
o TEST: rehearse and evaluate your plan
The promise of the cloud…delivered
Why N2WS?
N2WS backs up on an
instance level, including VPC
settings, security groups and
instance meta data
Recover anything from a
single file to your entire AWS
environment (yes, even
encrypted files)
Multi-tenancy
Manage multiple accounts
from 1 console ideal for
service providers or large
AWS environments
Restore AnythingVolumes vs. VMs
Your giant recovery button
Configure what to back up
and when - define backup
targets, frequency and
retention periods
Replicate snapshots to 1+
regions and recover quickly
in the event of any issue
Configure regular backups
of VPC settings and
recover to any region
VPC Capture and
Clone Tool
Cross-Region &
Cross-Account DR
Automated Policies
and Schedules
Design for failure: assume services do fail
o Reserve capacity to absorb AZ services failures: use reserved instances to
guarantee capacity
o Eliminate single points of failure: ensure you use services that are designed
for HA (e.g. using a NAT Gateway vs a NAT instance for internet access)
o Replicate data: replicate across different regions/accounts
o Create redundancy: create services using an active-passive or active-
active configuration
o Test: always test (and test again)!
Creating resiliency through Recovery failure injections
N2WS 2020 Cloud Report Survey
20% NEVER perform
recovery drills!
DRY RUN: Configure your recover
scenario prior to restore and be notified
of any potential configuration failure
Chaos Engineer DR: N2WS Recovery Scenarios
Execute pre and post backup scripts, define order of recovery
targets, enable a worker configuration test for S3 repositories
Automate a pre-defined recovery plan
and carry out ‘bulk’ DR drills recovering
multiple targets with ONE CLICK
Over HALF rely on
cross-region DR
Only 10% use cross-
account DR
Nearly 20% had NO
PLAN at all
Current Disaster Recovery Plan
Cross-region data protection
Protect against
regional outages
with cross-region
disaster recovery
Cross-account data protection
Protect against
account
compromises with
cross-account
disaster recovery
The ultimate data protection
Snapshot Vault
Use BOTH cross-
region and cross-
account DR to
create a highly
secure “snapshot
vault”
Demo Time!
Used by AWS builders, worldwide
AWS Accounts
5K+
Petabytes of Backup
13+
HUNDREDS of
THOUSANDS of
Protected Instances
THOUSANDS of
End-users & Service
Providers
Share your results!
Was it expected?
Did we detect it?
Did our system mitigate it?
What would be the impact?
How will we fix it?
How can we improve next time?
Migrate to the Cloud
Mitigate Dependency Failure
Shift to Cloud Native
Verify Monitoring
Train Teams
Where to get started
Test Disaster Recovery
Sign up for Gremlin Free
app.gremlin.com/signup
Sign up for N2WS Free Trial
n2ws.com/trial
Q&A

More Related Content

Similar to Embracing Chaos Engineering to Validate Disaster Recovery Plans

Optimize your AWS FEST - N2WS session
Optimize your AWS FEST - N2WS sessionOptimize your AWS FEST - N2WS session
Optimize your AWS FEST - N2WS sessionOK2OK
 
How to Ransomware-Proof your AWS Cloud
How to Ransomware-Proof your AWS CloudHow to Ransomware-Proof your AWS Cloud
How to Ransomware-Proof your AWS CloudOK2OK
 
Disaster recovery and the cloud
Disaster recovery and the cloudDisaster recovery and the cloud
Disaster recovery and the cloudJason Dea
 
Cloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewCloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewPT Datacomm Diangraha
 
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best PracticesThe Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best Practicesiland Cloud
 
Client presentation disaster recovery as a service
Client presentation   disaster recovery as a serviceClient presentation   disaster recovery as a service
Client presentation disaster recovery as a serviceAjay V Singh
 
How to Make an Effective Cloud Disaster Recovery Strategy.pdf
How to Make an Effective Cloud Disaster Recovery Strategy.pdfHow to Make an Effective Cloud Disaster Recovery Strategy.pdf
How to Make an Effective Cloud Disaster Recovery Strategy.pdfSysvoot Antivirus
 
Virtual Disaster Recovery ROI
Virtual Disaster Recovery ROIVirtual Disaster Recovery ROI
Virtual Disaster Recovery ROIJason Dea
 
How to centralize + monitor the health of your hybrid, private and public clouds
How to centralize + monitor the health of your hybrid, private and public cloudsHow to centralize + monitor the health of your hybrid, private and public clouds
How to centralize + monitor the health of your hybrid, private and public cloudsOK2OK
 
November 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie ApocalypseNovember 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie ApocalypseRapidScale
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and TestingJason Dea
 
Enterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankEnterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankactualtechmedia
 
Deploy Microservices in the Real World
Deploy Microservices in the Real WorldDeploy Microservices in the Real World
Deploy Microservices in the Real WorldElana Krasner
 
Successful_BC_Strategy.pdf
Successful_BC_Strategy.pdfSuccessful_BC_Strategy.pdf
Successful_BC_Strategy.pdfmykovalenko1
 
Moving Forward Faster: How Monash University Automated Data on AWS with Commv...
Moving Forward Faster: How Monash University Automated Data on AWS with Commv...Moving Forward Faster: How Monash University Automated Data on AWS with Commv...
Moving Forward Faster: How Monash University Automated Data on AWS with Commv...Amazon Web Services
 
New Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningNew Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningJason Dea
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankCloudEndure
 
Mmckeown hadr that_conf
Mmckeown hadr that_confMmckeown hadr that_conf
Mmckeown hadr that_confMike McKeown
 

Similar to Embracing Chaos Engineering to Validate Disaster Recovery Plans (20)

Optimize your AWS FEST - N2WS session
Optimize your AWS FEST - N2WS sessionOptimize your AWS FEST - N2WS session
Optimize your AWS FEST - N2WS session
 
How to Ransomware-Proof your AWS Cloud
How to Ransomware-Proof your AWS CloudHow to Ransomware-Proof your AWS Cloud
How to Ransomware-Proof your AWS Cloud
 
Disaster recovery and the cloud
Disaster recovery and the cloudDisaster recovery and the cloud
Disaster recovery and the cloud
 
Cloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewCloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service Overview
 
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best PracticesThe Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
 
Client presentation disaster recovery as a service
Client presentation   disaster recovery as a serviceClient presentation   disaster recovery as a service
Client presentation disaster recovery as a service
 
How to Make an Effective Cloud Disaster Recovery Strategy.pdf
How to Make an Effective Cloud Disaster Recovery Strategy.pdfHow to Make an Effective Cloud Disaster Recovery Strategy.pdf
How to Make an Effective Cloud Disaster Recovery Strategy.pdf
 
Virtual Disaster Recovery ROI
Virtual Disaster Recovery ROIVirtual Disaster Recovery ROI
Virtual Disaster Recovery ROI
 
AWS Services 7 Transformation Media
AWS Services 7 Transformation MediaAWS Services 7 Transformation Media
AWS Services 7 Transformation Media
 
How to centralize + monitor the health of your hybrid, private and public clouds
How to centralize + monitor the health of your hybrid, private and public cloudsHow to centralize + monitor the health of your hybrid, private and public clouds
How to centralize + monitor the health of your hybrid, private and public clouds
 
November 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie ApocalypseNovember 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and Testing
 
Enterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankEnterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bank
 
Deploy Microservices in the Real World
Deploy Microservices in the Real WorldDeploy Microservices in the Real World
Deploy Microservices in the Real World
 
Successful_BC_Strategy.pdf
Successful_BC_Strategy.pdfSuccessful_BC_Strategy.pdf
Successful_BC_Strategy.pdf
 
Moving Forward Faster: How Monash University Automated Data on AWS with Commv...
Moving Forward Faster: How Monash University Automated Data on AWS with Commv...Moving Forward Faster: How Monash University Automated Data on AWS with Commv...
Moving Forward Faster: How Monash University Automated Data on AWS with Commv...
 
New Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningNew Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery Planning
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the Bank
 
Mmckeown hadr that_conf
Mmckeown hadr that_confMmckeown hadr that_conf
Mmckeown hadr that_conf
 
CVx_Pilot_DR_DS
CVx_Pilot_DR_DSCVx_Pilot_DR_DS
CVx_Pilot_DR_DS
 

More from OK2OK

NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver!
NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver! NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver!
NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver! OK2OK
 
On-Prem to All-In: How Versent Leads Successful AWS Migrations
On-Prem to All-In: How Versent Leads Successful AWS MigrationsOn-Prem to All-In: How Versent Leads Successful AWS Migrations
On-Prem to All-In: How Versent Leads Successful AWS MigrationsOK2OK
 
NEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & Demo
NEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & DemoNEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & Demo
NEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & DemoOK2OK
 
The accelerated "lift and shift": How iFeu guided a successful migration to AWS
The accelerated "lift and shift": How iFeu guided a successful migration to AWSThe accelerated "lift and shift": How iFeu guided a successful migration to AWS
The accelerated "lift and shift": How iFeu guided a successful migration to AWSOK2OK
 
5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWS
5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWS5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWS
5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWSOK2OK
 
NEW N2WS Backup & Recovery v3.1 Overview & Demo
NEW N2WS Backup & Recovery v3.1 Overview & DemoNEW N2WS Backup & Recovery v3.1 Overview & Demo
NEW N2WS Backup & Recovery v3.1 Overview & DemoOK2OK
 
Automate and accelerate AWS migrations with CloudChomp and N2WS
Automate and accelerate AWS migrations with CloudChomp and N2WS Automate and accelerate AWS migrations with CloudChomp and N2WS
Automate and accelerate AWS migrations with CloudChomp and N2WS OK2OK
 
How Successful Companies backup their AWS data, workloads and applications wh...
How Successful Companies backup their AWS data, workloads and applications wh...How Successful Companies backup their AWS data, workloads and applications wh...
How Successful Companies backup their AWS data, workloads and applications wh...OK2OK
 
Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0
Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0
Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0OK2OK
 
5 Takeaways from AWS re:Invent 2019
5 Takeaways from AWS re:Invent 20195 Takeaways from AWS re:Invent 2019
5 Takeaways from AWS re:Invent 2019OK2OK
 
Cloud Resilience and Container Workload Automation
Cloud Resilience and Container Workload AutomationCloud Resilience and Container Workload Automation
Cloud Resilience and Container Workload AutomationOK2OK
 

More from OK2OK (11)

NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver!
NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver! NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver!
NEW RELEASE: N2WS Backup & Recovery now with AnySnap Archiver!
 
On-Prem to All-In: How Versent Leads Successful AWS Migrations
On-Prem to All-In: How Versent Leads Successful AWS MigrationsOn-Prem to All-In: How Versent Leads Successful AWS Migrations
On-Prem to All-In: How Versent Leads Successful AWS Migrations
 
NEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & Demo
NEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & DemoNEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & Demo
NEW RELEASE: N2WS Backup & Recovery v3.2 - Overview & Demo
 
The accelerated "lift and shift": How iFeu guided a successful migration to AWS
The accelerated "lift and shift": How iFeu guided a successful migration to AWSThe accelerated "lift and shift": How iFeu guided a successful migration to AWS
The accelerated "lift and shift": How iFeu guided a successful migration to AWS
 
5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWS
5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWS5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWS
5 Key Pieces you are missing when dealing with Data Lifecycle Management in AWS
 
NEW N2WS Backup & Recovery v3.1 Overview & Demo
NEW N2WS Backup & Recovery v3.1 Overview & DemoNEW N2WS Backup & Recovery v3.1 Overview & Demo
NEW N2WS Backup & Recovery v3.1 Overview & Demo
 
Automate and accelerate AWS migrations with CloudChomp and N2WS
Automate and accelerate AWS migrations with CloudChomp and N2WS Automate and accelerate AWS migrations with CloudChomp and N2WS
Automate and accelerate AWS migrations with CloudChomp and N2WS
 
How Successful Companies backup their AWS data, workloads and applications wh...
How Successful Companies backup their AWS data, workloads and applications wh...How Successful Companies backup their AWS data, workloads and applications wh...
How Successful Companies backup their AWS data, workloads and applications wh...
 
Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0
Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0
Next-gen Backup for AWS is HERE: N2WS Backup & Recovery v3.0
 
5 Takeaways from AWS re:Invent 2019
5 Takeaways from AWS re:Invent 20195 Takeaways from AWS re:Invent 2019
5 Takeaways from AWS re:Invent 2019
 
Cloud Resilience and Container Workload Automation
Cloud Resilience and Container Workload AutomationCloud Resilience and Container Workload Automation
Cloud Resilience and Container Workload Automation
 

Recently uploaded

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 

Recently uploaded (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 

Embracing Chaos Engineering to Validate Disaster Recovery Plans

  • 1. Avoiding Disasters by Embracing Chaos: Validating Disaster Recovery with Chaos Engineering
  • 2. Sebastian Straub Principal Solutions Architect, N2WS sebastian@n2ws.com Taylor Smith Product Marketing Manager, Gremlin taylor.smith@gremlin.com Meet our experts
  • 3. Black Friday failures Banks breaking Airline incidents Computer Problems Blamed For Flight Delays 4.1.19 Citibank Website down, not working 2.28.19 Technical Issues Likely Cost Retailers Billions 12.01.16
  • 4. Availability vs. Rate of Change Rate of Change Availability 1 10 100 1000 99.9999% 99.999% 99.99% 99.9% 99% 90% The Reliability Gap Change introduces new forms of failure that are difficult to see before the fact.... - Richard Cook, How Complex Systems Fail
  • 5. Thoughtful, controlled experiments designed to reveal the weakness in our systems. Chaos Engineering Test People Processes Application Infrastructure
  • 6. Progressively test your system to isolate problems and mitigate risk
  • 7. Engineers proactively test to find and fix issues and limit the impact of failures And are more effective when they work reactively during an incident
  • 8. Region 2 AZ 4AZ 3 Region 1 AZ 2AZ 1 How do you use Chaos Engineering for Disaster Recovery? 1 Start small and expand the Blast Radius 1 Blackhole the connection to a database node, and see how the application reacts. 2 Shutdown a container, pod, node to check for our Kubernetes reaction. 3 Blackhole the connection to an entire Availability Zone. 4 4 Blackhole the connection to all the instances in an entire region.3 2 Active Passive Key questions: ● How did our autoscaling, load balancers & gateways react? ● Do we have enough redundancy in place? ● Did our monitoring & alerting trigger at the right time? ● Was our team able to react and recover fast enough?
  • 9. The risks of unplanned downtime If staff cannot access systems, they cannot do their job Potential damage to reputation with suppliers, partners, customers Huge financial loss can result from even one hour of downtime + possible ransom/forensic costs Data damage Reputational impact Financial loss Irreplaceable data damage as a result of a malicious attack Lost productivity
  • 10. Compliance + Data Security EBS Failure Ransomware AZ Failure Human Error Why breaking things should be practiced “Everything fails all the time” —Werner Vogels
  • 11. Build confidence in your DR plan o Take stock: of IT assets o Define critical resources: Identify the most critical AWS resources. o Assess the risk: identify threats and define RTO/RPO o Document the plan: identify gaps and single points of failure o TEST: rehearse and evaluate your plan
  • 12. The promise of the cloud…delivered
  • 13. Why N2WS? N2WS backs up on an instance level, including VPC settings, security groups and instance meta data Recover anything from a single file to your entire AWS environment (yes, even encrypted files) Multi-tenancy Manage multiple accounts from 1 console ideal for service providers or large AWS environments Restore AnythingVolumes vs. VMs
  • 14. Your giant recovery button Configure what to back up and when - define backup targets, frequency and retention periods Replicate snapshots to 1+ regions and recover quickly in the event of any issue Configure regular backups of VPC settings and recover to any region VPC Capture and Clone Tool Cross-Region & Cross-Account DR Automated Policies and Schedules
  • 15. Design for failure: assume services do fail o Reserve capacity to absorb AZ services failures: use reserved instances to guarantee capacity o Eliminate single points of failure: ensure you use services that are designed for HA (e.g. using a NAT Gateway vs a NAT instance for internet access) o Replicate data: replicate across different regions/accounts o Create redundancy: create services using an active-passive or active- active configuration o Test: always test (and test again)! Creating resiliency through Recovery failure injections
  • 16. N2WS 2020 Cloud Report Survey 20% NEVER perform recovery drills!
  • 17. DRY RUN: Configure your recover scenario prior to restore and be notified of any potential configuration failure Chaos Engineer DR: N2WS Recovery Scenarios Execute pre and post backup scripts, define order of recovery targets, enable a worker configuration test for S3 repositories Automate a pre-defined recovery plan and carry out ‘bulk’ DR drills recovering multiple targets with ONE CLICK
  • 18. Over HALF rely on cross-region DR Only 10% use cross- account DR Nearly 20% had NO PLAN at all Current Disaster Recovery Plan
  • 19. Cross-region data protection Protect against regional outages with cross-region disaster recovery
  • 20. Cross-account data protection Protect against account compromises with cross-account disaster recovery
  • 21. The ultimate data protection Snapshot Vault Use BOTH cross- region and cross- account DR to create a highly secure “snapshot vault”
  • 23. Used by AWS builders, worldwide AWS Accounts 5K+ Petabytes of Backup 13+ HUNDREDS of THOUSANDS of Protected Instances THOUSANDS of End-users & Service Providers
  • 24. Share your results! Was it expected? Did we detect it? Did our system mitigate it? What would be the impact? How will we fix it? How can we improve next time?
  • 25. Migrate to the Cloud Mitigate Dependency Failure Shift to Cloud Native Verify Monitoring Train Teams Where to get started Test Disaster Recovery
  • 26. Sign up for Gremlin Free app.gremlin.com/signup Sign up for N2WS Free Trial n2ws.com/trial Q&A