SlideShare a Scribd company logo
Disaster Recovery
What, Why and How
Manish Pandit
Silicon Valley Code Camp, 2018
Why
Define and contextualize Disaster Recovery in a business and technical context
without boiling the ocean.
In other words, this is a very, very high level overview of a topic where each slide
can easily be a session on it’s own.
About Me
Manish Pandit
Sr. Director of Engineering at Marqeta
@lobster1234
lobster1234.github.io
Sorry for the....math :(
Availability
A measure of % of time a service is in a usable state.
Also measured in 9s.
Scheduled downtimes do not count towards availability, but may impact customer
satisfaction metrics (more so in a B2C model).
Uptime
Often interchangeable with Availability
Gotcha: Uptime does not mean much if the server cannot serve requests
Reliability
A measure of the probability of the service being in a usable state for a period of
time.
Mean Time to Failure (MTTF)
Mean Time to Repair (MTTR)
Mean Time between Failures (MTBF)
Mostly used for hardware such as Network/IO controllers, power supplies, etc.
Reliability
“A rack switch goes unresponsive for 28 mins every day”
MTTF = 23 hours 32 minutes
MTTR = 28 minutes
MTBF = 24 hours (MTTF + MTTR)
Disasters
BCP
Business Continuity Plan
“Business continuity planning (or business continuity and resiliency planning) is the
process of creating systems of prevention and recovery to deal with potential
threats to a company.” - Wikipedia
Usually owned and managed by the COO
Disaster Recovery
Disaster Recovery starts where High Availability stops.
Disaster Recovery
Disaster Recovery is a component of BCP, covering the technical/infrastructure
aspects.
Usually owned and managed by the CTO/CIO.
But...how do we put metrics around Disaster Recovery Plan?
RPO
Recovery Point Objective
The maximum amount of data loss that is tolerable without significant impact to
business continuity.
Always defined backwards in time.
Ideal value = 0
RPO
If the RPO is 4 hours, it’d mean you must have (good) backups of data no older
than 4 hours.
Think about your laptop. How much far back in time you can go where any data
loss beyond that time is tolerable?
RTO
Recovery Time Objective
Wider than RPO - Covers more than just data.
The maximum amount of time the system can remain unavailable without
significant impact to the business continuity.
Ideal value = 0
Source: CloudAcademy
RTO and RPO
If it takes 2 hours to restore the last backup that was done 4 hours ago, then RTO
is >= 2 hours, and RPO is >= 4 hours.
If a master fails, and the slave is 10 minutes behind, your RPO cannot be < 10
minutes. If the application needs to be bounced to update the db connections
which takes 10 minutes, then the RTO cannot be < 10 minutes.
PTO*
Paid Time Off following the the disaster recovery.
*It is more or less a convention to throw PTO in there.
Who decides RTO and RPO?
The business does.
That’s easy - get me zero RTO and RPO
Zero RTO and/or RPO is realistically impossible. (why?)
The business has to establish the tolerable RTO and RPO.
This acts as a requirements-spec for the DR Plan and Implementation.
These limits also help establish the SLA with customers.
Tolerable?
For a bank, an RPO greater than a few minutes = lost transactions.
For an online broker, an RTO greater than a few minutes = lost trades.
For a media company, RTO greater than a few minutes = angry tweets.
For a static website, weekly backups are acceptable with a RPO of 1 week.
For an HR system, RPO greater than a day may be acceptable, but RTO greater
than a few hours may not.
Common Failures
Network backbone/ISP Outage
Software Bugs
Storage Controller/NFS Crashes
Disruptive changes to security settings/firewalls
Corrupt DNS configuration being replicated
AWS/Public Cloud Outage
Hybrid Cloud
Most companies run a hybrid cloud, which means the infrastructure is split (usually
disproportionately) between on-prem and public cloud.
Backup & Restore
Regular backups are copied to the recovery site.
Infrastructure has to be spun up on the recovery site in the event of a disaster.
RPO and RTO can be in hours, if not days.
Inexpensive - Costs few hundred dollars a month for the storage.
Pilot Light
Data is replicated asynchronously to the failover site
Infrastructure is provisioned, but needs to be started before taking any traffic
(RTO!)
Data replication may be a few seconds/minutes behind (RPO!)
Lower RTO and RPO than Backup & Restore, a bit more $$ for replication.
Warm Standby
Scaled down infrastructure is provisioned, running, ready to take on traffic.
May need to be scaled up to handle full production load (Autoscale!)
Data replication may be a few seconds/minutes behind (RPO!)
Lower RTO than Pilot Light, more $$ (why?)
Multi-Site
Multiple sites taking live production traffic
Difficult to pull off due to database constraints (multi-master, anyone?)
When done right, RPO and RTO of a few seconds to few minutes
Costs an arm and a leg
Multi Cloud
Mother of them all.
Automation to support multiple cloud providers, plus on- prem.
RPO and RTO similar to multi-site, but provides isolation at a provider level.
Costs an arm, a leg, and a kidney.
Fail Back
Reverse the data flow
Freeze the DR site
Route traffic to primary site
Unfreeze the DR site
So...
Survey the Land
Start with measuring your current RTO and RPO.
Gather data
You cannot improve what you cannot measure.
Bonus - Detect anomalies across the board.
Runbooks
Write them, and keep them updated.
Review your automation
Automate the infrastructure build out, IaaC
Follow the Pull-request model for infrastructure changes.
Automating a destructive script (unintentionally) is the quickest way to a disaster.
foreach ($env == ‘prod’); sudo chmod -R -rx
Practice the DR Plan!
Failure-as-a-service
Inject failures in the infrastructure.
Measure of readiness.
Chaos Engineering.
Netflix - Simian Army
Amazon Aurora Fault Injection Queries
Not all components are equal - neither should their DRs
Blast Radius
A DNS failure can take down an entire data center.
A faulty switch can take down entire subnet.
A service failure can take down all others dependent on it.
A Region failure has larger blast radius than an Availability Zone failure
A Provider failure has larger blast radius than a Region failure.
Design for Fault Tolerance and Graceful Degradation
Prefer evented over synchronous processing wherever possible
Always assume failure
In the cloud, there are no edge cases
Dashboards - Internal and External
Service health monitoring is critical..
..so is ensuring that the monitors themselves can survive a disaster.
Finally
Make disaster recovery and high availability a topic of discussion during every
stage of a project.
Ask the hard questions.
Embrace failure - learn from it.
We’re hiring!
Thank you!
Manish Pandit
Sr. Director of Engineering at Marqeta
@lobster1234
lobster1234.github.io

More Related Content

What's hot

Reports from the field azure functions in practice
Reports from the field   azure functions in practiceReports from the field   azure functions in practice
Reports from the field azure functions in practice
Particular Software
 
Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)
Szabolcs Zajdó
 
What is Cloud Computing?
What is Cloud Computing?What is Cloud Computing?
What is Cloud Computing?
Axelisys Limited
 
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWSAWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
Amazon Web Services
 
Backup & Restore Seamlessly with Industry-Leading Integration
Backup & Restore Seamlessly with Industry-Leading IntegrationBackup & Restore Seamlessly with Industry-Leading Integration
Backup & Restore Seamlessly with Industry-Leading Integration
Amazon Web Services
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
DataStax Academy
 
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum EfficiencyDisaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Amazon Web Services
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
Ruslan Meshenberg
 
The dev ops void &amp; value stream mapping
The dev ops void &amp; value stream mappingThe dev ops void &amp; value stream mapping
The dev ops void &amp; value stream mapping
Enov8
 
Aws disaster recovery
Aws disaster recoveryAws disaster recovery
Aws disaster recovery
Bipeen Sinha
 
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the CloudAWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
Amazon Web Services
 
Journey Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster RecoveryJourney Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster Recovery
Amazon Web Services
 
Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudAmazon Web Services
 
Characterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the CloudsCharacterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the Clouds
inside-BigData.com
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
Igor Sfiligoi
 
AWS Webinar 201 - Backup, Archive and Disaster Recovery
AWS Webinar 201 - Backup, Archive and Disaster RecoveryAWS Webinar 201 - Backup, Archive and Disaster Recovery
AWS Webinar 201 - Backup, Archive and Disaster Recovery
Amazon Web Services
 
AWS Webcast - Disaster Recovery
AWS Webcast - Disaster RecoveryAWS Webcast - Disaster Recovery
AWS Webcast - Disaster Recovery
Amazon Web Services
 
KGC 2013 AWS session
KGC 2013 AWS session KGC 2013 AWS session
KGC 2013 AWS session
Amazon Web Services Korea
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
Amazon Web Services
 
Univa Presentation at DAC 2020
Univa Presentation at DAC 2020 Univa Presentation at DAC 2020
Univa Presentation at DAC 2020
Univa, an Altair Company
 

What's hot (20)

Reports from the field azure functions in practice
Reports from the field   azure functions in practiceReports from the field   azure functions in practice
Reports from the field azure functions in practice
 
Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)
 
What is Cloud Computing?
What is Cloud Computing?What is Cloud Computing?
What is Cloud Computing?
 
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWSAWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
 
Backup & Restore Seamlessly with Industry-Leading Integration
Backup & Restore Seamlessly with Industry-Leading IntegrationBackup & Restore Seamlessly with Industry-Leading Integration
Backup & Restore Seamlessly with Industry-Leading Integration
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
 
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum EfficiencyDisaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
The dev ops void &amp; value stream mapping
The dev ops void &amp; value stream mappingThe dev ops void &amp; value stream mapping
The dev ops void &amp; value stream mapping
 
Aws disaster recovery
Aws disaster recoveryAws disaster recovery
Aws disaster recovery
 
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the CloudAWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
 
Journey Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster RecoveryJourney Through the Cloud: Disaster Recovery
Journey Through the Cloud: Disaster Recovery
 
Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS Cloud
 
Characterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the CloudsCharacterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the Clouds
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
 
AWS Webinar 201 - Backup, Archive and Disaster Recovery
AWS Webinar 201 - Backup, Archive and Disaster RecoveryAWS Webinar 201 - Backup, Archive and Disaster Recovery
AWS Webinar 201 - Backup, Archive and Disaster Recovery
 
AWS Webcast - Disaster Recovery
AWS Webcast - Disaster RecoveryAWS Webcast - Disaster Recovery
AWS Webcast - Disaster Recovery
 
KGC 2013 AWS session
KGC 2013 AWS session KGC 2013 AWS session
KGC 2013 AWS session
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
Univa Presentation at DAC 2020
Univa Presentation at DAC 2020 Univa Presentation at DAC 2020
Univa Presentation at DAC 2020
 

Similar to Disaster recovery - What, Why, and How

Disaster Recovery and Reliability
Disaster Recovery and ReliabilityDisaster Recovery and Reliability
Disaster Recovery and Reliability
Manish Pandit
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
jrm1224
 
Business Continuity Presentation
Business Continuity PresentationBusiness Continuity Presentation
Business Continuity Presentation
perry57123
 
Deepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptxDeepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptx
ssuser20fcbe
 
Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology
Andrew Miller
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
actifio
 
ProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USMudia Akpobome
 
Enterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankEnterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bank
actualtechmedia
 
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWSDisaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Amazon Web Services
 
IBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERYIBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM India Smarter Computing
 
AITP July 2012 Presentation - Disaster Recovery - Business + Technology
AITP July 2012 Presentation - Disaster Recovery - Business + TechnologyAITP July 2012 Presentation - Disaster Recovery - Business + Technology
AITP July 2012 Presentation - Disaster Recovery - Business + Technology
Andrew Miller
 
7_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-167_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-16Peak 10
 
Bluelock's Recovery Suite
Bluelock's Recovery SuiteBluelock's Recovery Suite
Bluelock's Recovery Suite
Bluelock
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Donna Perlstein
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the Bank
CloudEndure
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
Directi Group
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
muralis3
 
DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?
MaryJWilliams2
 

Similar to Disaster recovery - What, Why, and How (20)

Disaster Recovery and Reliability
Disaster Recovery and ReliabilityDisaster Recovery and Reliability
Disaster Recovery and Reliability
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
 
Business Continuity Presentation
Business Continuity PresentationBusiness Continuity Presentation
Business Continuity Presentation
 
Deepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptxDeepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptx
 
Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
ProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-US
 
Enterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankEnterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bank
 
CS_10_DR_CFD
CS_10_DR_CFDCS_10_DR_CFD
CS_10_DR_CFD
 
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWSDisaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS
 
HADRFINAL13112016
HADRFINAL13112016HADRFINAL13112016
HADRFINAL13112016
 
IBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERYIBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERY
 
AITP July 2012 Presentation - Disaster Recovery - Business + Technology
AITP July 2012 Presentation - Disaster Recovery - Business + TechnologyAITP July 2012 Presentation - Disaster Recovery - Business + Technology
AITP July 2012 Presentation - Disaster Recovery - Business + Technology
 
7_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-167_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-16
 
Bluelock's Recovery Suite
Bluelock's Recovery SuiteBluelock's Recovery Suite
Bluelock's Recovery Suite
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the Bank
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the Bank
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
 
DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?
 

More from Manish Pandit

OAuth2 primer
OAuth2 primerOAuth2 primer
OAuth2 primer
Manish Pandit
 
Immutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and JenkinsImmutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and Jenkins
Manish Pandit
 
AWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and JavaAWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and Java
Manish Pandit
 
AWS Primer and Quickstart
AWS Primer and QuickstartAWS Primer and Quickstart
AWS Primer and Quickstart
Manish Pandit
 
Securing your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID ConnectSecuring your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID Connect
Manish Pandit
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API Antipatterns
Manish Pandit
 
Scalabay - API Design Antipatterns
Scalabay - API Design AntipatternsScalabay - API Design Antipatterns
Scalabay - API Design Antipatterns
Manish Pandit
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
Manish Pandit
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SF
Manish Pandit
 
Motivation : it Matters
Motivation : it MattersMotivation : it Matters
Motivation : it Matters
Manish Pandit
 
Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2
Manish Pandit
 
Scala at Netflix
Scala at NetflixScala at Netflix
Scala at Netflix
Manish Pandit
 
Introducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGNIntroducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGN
Manish Pandit
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with Scala
Manish Pandit
 
IGN's V3 API
IGN's V3 APIIGN's V3 API
IGN's V3 API
Manish Pandit
 
Java and the JVM
Java and the JVMJava and the JVM
Java and the JVM
Manish Pandit
 
Object Oriented Programming
Object Oriented ProgrammingObject Oriented Programming
Object Oriented Programming
Manish Pandit
 
Silicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you RESTSilicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you REST
Manish Pandit
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Manish Pandit
 
NoSQLCamp : MongoDB at IGN
NoSQLCamp : MongoDB at IGNNoSQLCamp : MongoDB at IGN
NoSQLCamp : MongoDB at IGN
Manish Pandit
 

More from Manish Pandit (20)

OAuth2 primer
OAuth2 primerOAuth2 primer
OAuth2 primer
 
Immutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and JenkinsImmutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and Jenkins
 
AWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and JavaAWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and Java
 
AWS Primer and Quickstart
AWS Primer and QuickstartAWS Primer and Quickstart
AWS Primer and Quickstart
 
Securing your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID ConnectSecuring your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID Connect
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API Antipatterns
 
Scalabay - API Design Antipatterns
Scalabay - API Design AntipatternsScalabay - API Design Antipatterns
Scalabay - API Design Antipatterns
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SF
 
Motivation : it Matters
Motivation : it MattersMotivation : it Matters
Motivation : it Matters
 
Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2
 
Scala at Netflix
Scala at NetflixScala at Netflix
Scala at Netflix
 
Introducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGNIntroducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGN
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with Scala
 
IGN's V3 API
IGN's V3 APIIGN's V3 API
IGN's V3 API
 
Java and the JVM
Java and the JVMJava and the JVM
Java and the JVM
 
Object Oriented Programming
Object Oriented ProgrammingObject Oriented Programming
Object Oriented Programming
 
Silicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you RESTSilicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you REST
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
 
NoSQLCamp : MongoDB at IGN
NoSQLCamp : MongoDB at IGNNoSQLCamp : MongoDB at IGN
NoSQLCamp : MongoDB at IGN
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 

Disaster recovery - What, Why, and How

  • 1. Disaster Recovery What, Why and How Manish Pandit Silicon Valley Code Camp, 2018
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Why Define and contextualize Disaster Recovery in a business and technical context without boiling the ocean. In other words, this is a very, very high level overview of a topic where each slide can easily be a session on it’s own.
  • 7. About Me Manish Pandit Sr. Director of Engineering at Marqeta @lobster1234 lobster1234.github.io
  • 9. Availability A measure of % of time a service is in a usable state. Also measured in 9s. Scheduled downtimes do not count towards availability, but may impact customer satisfaction metrics (more so in a B2C model).
  • 10.
  • 11. Uptime Often interchangeable with Availability Gotcha: Uptime does not mean much if the server cannot serve requests
  • 12. Reliability A measure of the probability of the service being in a usable state for a period of time. Mean Time to Failure (MTTF) Mean Time to Repair (MTTR) Mean Time between Failures (MTBF) Mostly used for hardware such as Network/IO controllers, power supplies, etc.
  • 13. Reliability “A rack switch goes unresponsive for 28 mins every day” MTTF = 23 hours 32 minutes MTTR = 28 minutes MTBF = 24 hours (MTTF + MTTR)
  • 15. BCP Business Continuity Plan “Business continuity planning (or business continuity and resiliency planning) is the process of creating systems of prevention and recovery to deal with potential threats to a company.” - Wikipedia Usually owned and managed by the COO
  • 16. Disaster Recovery Disaster Recovery starts where High Availability stops.
  • 17. Disaster Recovery Disaster Recovery is a component of BCP, covering the technical/infrastructure aspects. Usually owned and managed by the CTO/CIO.
  • 18. But...how do we put metrics around Disaster Recovery Plan?
  • 19. RPO Recovery Point Objective The maximum amount of data loss that is tolerable without significant impact to business continuity. Always defined backwards in time. Ideal value = 0
  • 20. RPO If the RPO is 4 hours, it’d mean you must have (good) backups of data no older than 4 hours. Think about your laptop. How much far back in time you can go where any data loss beyond that time is tolerable?
  • 21. RTO Recovery Time Objective Wider than RPO - Covers more than just data. The maximum amount of time the system can remain unavailable without significant impact to the business continuity. Ideal value = 0
  • 23. RTO and RPO If it takes 2 hours to restore the last backup that was done 4 hours ago, then RTO is >= 2 hours, and RPO is >= 4 hours. If a master fails, and the slave is 10 minutes behind, your RPO cannot be < 10 minutes. If the application needs to be bounced to update the db connections which takes 10 minutes, then the RTO cannot be < 10 minutes.
  • 24. PTO* Paid Time Off following the the disaster recovery. *It is more or less a convention to throw PTO in there.
  • 25. Who decides RTO and RPO? The business does.
  • 26. That’s easy - get me zero RTO and RPO Zero RTO and/or RPO is realistically impossible. (why?) The business has to establish the tolerable RTO and RPO. This acts as a requirements-spec for the DR Plan and Implementation. These limits also help establish the SLA with customers.
  • 27. Tolerable? For a bank, an RPO greater than a few minutes = lost transactions. For an online broker, an RTO greater than a few minutes = lost trades. For a media company, RTO greater than a few minutes = angry tweets. For a static website, weekly backups are acceptable with a RPO of 1 week. For an HR system, RPO greater than a day may be acceptable, but RTO greater than a few hours may not.
  • 28. Common Failures Network backbone/ISP Outage Software Bugs Storage Controller/NFS Crashes Disruptive changes to security settings/firewalls Corrupt DNS configuration being replicated AWS/Public Cloud Outage
  • 29. Hybrid Cloud Most companies run a hybrid cloud, which means the infrastructure is split (usually disproportionately) between on-prem and public cloud.
  • 30. Backup & Restore Regular backups are copied to the recovery site. Infrastructure has to be spun up on the recovery site in the event of a disaster. RPO and RTO can be in hours, if not days. Inexpensive - Costs few hundred dollars a month for the storage.
  • 31. Pilot Light Data is replicated asynchronously to the failover site Infrastructure is provisioned, but needs to be started before taking any traffic (RTO!) Data replication may be a few seconds/minutes behind (RPO!) Lower RTO and RPO than Backup & Restore, a bit more $$ for replication.
  • 32. Warm Standby Scaled down infrastructure is provisioned, running, ready to take on traffic. May need to be scaled up to handle full production load (Autoscale!) Data replication may be a few seconds/minutes behind (RPO!) Lower RTO than Pilot Light, more $$ (why?)
  • 33. Multi-Site Multiple sites taking live production traffic Difficult to pull off due to database constraints (multi-master, anyone?) When done right, RPO and RTO of a few seconds to few minutes Costs an arm and a leg
  • 34. Multi Cloud Mother of them all. Automation to support multiple cloud providers, plus on- prem. RPO and RTO similar to multi-site, but provides isolation at a provider level. Costs an arm, a leg, and a kidney.
  • 35. Fail Back Reverse the data flow Freeze the DR site Route traffic to primary site Unfreeze the DR site
  • 36. So...
  • 37. Survey the Land Start with measuring your current RTO and RPO.
  • 38. Gather data You cannot improve what you cannot measure. Bonus - Detect anomalies across the board.
  • 39. Runbooks Write them, and keep them updated.
  • 40. Review your automation Automate the infrastructure build out, IaaC Follow the Pull-request model for infrastructure changes. Automating a destructive script (unintentionally) is the quickest way to a disaster. foreach ($env == ‘prod’); sudo chmod -R -rx
  • 42.
  • 43. Failure-as-a-service Inject failures in the infrastructure. Measure of readiness. Chaos Engineering. Netflix - Simian Army Amazon Aurora Fault Injection Queries
  • 44. Not all components are equal - neither should their DRs
  • 45. Blast Radius A DNS failure can take down an entire data center. A faulty switch can take down entire subnet. A service failure can take down all others dependent on it. A Region failure has larger blast radius than an Availability Zone failure A Provider failure has larger blast radius than a Region failure.
  • 46. Design for Fault Tolerance and Graceful Degradation Prefer evented over synchronous processing wherever possible Always assume failure In the cloud, there are no edge cases
  • 47. Dashboards - Internal and External Service health monitoring is critical.. ..so is ensuring that the monitors themselves can survive a disaster.
  • 48. Finally Make disaster recovery and high availability a topic of discussion during every stage of a project. Ask the hard questions. Embrace failure - learn from it.
  • 50.
  • 51. Thank you! Manish Pandit Sr. Director of Engineering at Marqeta @lobster1234 lobster1234.github.io