SlideShare a Scribd company logo
1 of 45
Download to read offline
Disaster Recovery &
Reliability
Manish Pandit
03/26/2018
Why
Define and contextualize Disaster Recovery in a business and technical context
without boiling the ocean.
In other words, this is a very, very high level overview of a topic where each slide
can easily be a session on it’s own.
Sorry for the....math :(
Availability
A measure of % of time a service is in a usable state.
Also measured in 9s.
Scheduled downtimes do not count towards availability, but may impact customer
satisfaction metrics (more so in a B2C model).
Reliability
A measure of the probability of the service being in a usable state for a period of
time.
Measured as MTBF (Mean Time Between Failures), and the Failure Rate
Connecting Reliability & Availability
A database goes down for an unscheduled maintenance for an hour
Availability = 98% (or 1 Nine)
Reliability = 23 hours
MTBF = 23 hours; as I can rely on that db for only 23 hours.
Disasters
BCP
Business Continuity Plan
“Business continuity planning (or business continuity and resiliency planning) is the
process of creating systems of prevention and recovery to deal with potential
threats to a company.” - Wikipedia
Usually owned and managed by the COO
Disaster Recovery
Disaster Recovery starts where High Availability stops.
Disaster Recovery
Disaster Recovery is a component of BCP, covering the technical/infrastructure
area.
Usually owned and managed by the CTO/CIO.
But...how do we put metrics around Disaster Recovery Plan?
RPO
Recovery Point Objective
The maximum amount of data loss that is tolerable without significant impact to
business continuity.
Always defined backwards in time.
Ideal value = 0
RPO
If the RPO is 4 hours, it’d mean you must have (good) backups of data no older
than 4 hours.
Think about your laptop. How much far back in time you can go where any data
loss beyond that time is tolerable?
RTO
Recovery Time Objective
Wider than RPO - Covers more than just data.
The maximum amount of time the system can remain unavailable without
significant impact to the business continuity.
Ideal value = 0
Source: CloudAcademy
RTO and RPO
If it takes 2 hours to restore the last backup that was done 4 hours ago, then RTO
is >= 2 hours, and RPO is >= 4 hours.
If a master fails, and the slave is 10 minutes behind, your RPO cannot be < 10
minutes. If the application needs to be bounced to update the db connections
which takes 10 minutes, then the RTO cannot be < 10 minutes.
PTO
Paid Time Off following the the disaster recovery.
*It is more or less a convention to throw PTO in there.
Who decides RTO and RPO?
The business does.
That’s easy - get me zero RTO and RPO
Zero RTO and/or RPO is realistically impossible. (why?)
The business has to establish the tolerable RTO and RPO.
This acts as a requirements-spec for the DR Plan and Implementation.
These limits also help establish the SLA with customers.
Tolerable?
For a bank, an RPO greater than a few minutes = lost transactions.
For an online broker, an RTO greater than a few minutes = lost trades.
For a media company, RTO greater than a few minutes = angry tweets.
For a static website, weekly backups are acceptable with a RPO of 1 week.
For an HR system, RPO greater than a day may be acceptable, but RTO greater
than a few hours may not.
Hybrid Cloud
Most companies run a hybrid cloud, which means the infrastructure is split (usually
disproportionately) between on-prem and public cloud.
Common Failures
Network backbone/ISP Outage
Software Bugs
Storage Controller/NFS Crashes
Disruptive changes to security settings/firewalls
Corrupt DNS configuration being replicated
AWS/Public Cloud Outage
Backup & Restore
Regular backups are copied to the recovery site.
Infrastructure has to be spun up on the recovery site in the event of a disaster.
RPO and RTO can be in hours, if not days.
Inexpensive - Costs few hundred dollars a month for the storage.
Pilot Light
Infrastructure is provisioned, but needs to be started before taking any traffic
(RTO!)
Data replication may be a few seconds/minutes behind (RPO!)
Lower RTO and RPO than Backup & Restore, a bit more $$ for replication.
Warm Standby
Infrastructure is provisioned, ready to take on traffic.
It may need to be scaled up to handle full production load.
Data replication may be a few seconds/minutes behind (RPO!)
Lower RTO than Pilot Light, more $$ (why?)
Multi-Site
Multiple sites taking live production traffic
Difficult to pull off due to database constraints (multi-master, anyone?)
When done right, RPO and RTO of a few seconds to few minutes
Costs an arm and a leg
Multi Cloud
Mother of them all.
Automation to support multiple cloud providers, plus on- prem.
RPO and RTO similar to multi-site, but provides isolation at a provider level.
Costs an arm, a leg, and a kidney.
So...
Survey the Land
Start with measuring your current RTO and RPO.
Gather data
You cannot improve what you cannot measure.
Bonus - Detect anomalies across the board.
Runbooks
Write them, and keep them updated.
Review your automation
Follow the Pull-request model for infrastructure changes.
Automating a destructive script (unintentionally) is the quickest way to a disaster.
foreach ($env == ‘prod’); sudo chmod -R -rx
Practice the DR Plan!
Failure-as-a-service
Inject failures in the infrastructure.
Measure of readiness.
Chaos Engineering.
Netflix - Simian Army
Amazon Aurora Failure Injections
Not all components are equal - neither should their DRs
Blast Radius
A DNS failure can take down an entire data center.
A faulty switch can take down entire subnet.
A service failure can take down all others dependent on it.
A Region failure has larger blast radius than an Availability Zone failure
A Provider failure has larger blast radius than a Region failure.
Design for Fault Tolerance and Graceful Degradation
Use evented processing vs. synchronous wherever possible
Dashboards - Internal and External
Service health monitoring is critical..
..so is ensuring that the monitors themselves can survive a disaster.
Finally
Make disaster recovery and high availability a topic of discussion during every
stage of a project.
Ask the hard questions.
Embrace failure - learn from it.

More Related Content

What's hot

Disaster Recovery Plan for IT
Disaster Recovery Plan for ITDisaster Recovery Plan for IT
Disaster Recovery Plan for IThhuihhui
 
Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt omalreda
 
Recovery Time Objective and Recovery Point Objective
Recovery Time Objective and Recovery Point ObjectiveRecovery Time Objective and Recovery Point Objective
Recovery Time Objective and Recovery Point ObjectiveYankee Maharjan
 
Varrow Madness 2014 DR Presentation
Varrow Madness 2014 DR PresentationVarrow Madness 2014 DR Presentation
Varrow Madness 2014 DR PresentationAndrew Miller
 
Membangun Data Recovery Center / Disaster Recovery Center
Membangun Data Recovery Center / Disaster Recovery CenterMembangun Data Recovery Center / Disaster Recovery Center
Membangun Data Recovery Center / Disaster Recovery CenterFanky Christian
 
Disaster Recovery Plan
Disaster Recovery PlanDisaster Recovery Plan
Disaster Recovery PlanDavid Donovan
 
Disaster Recovery Planning: Best Practices, Templates, and Tools
Disaster Recovery Planning: Best Practices, Templates, and ToolsDisaster Recovery Planning: Best Practices, Templates, and Tools
Disaster Recovery Planning: Best Practices, Templates, and ToolsZetta Inc
 
Presentation on backup and recoveryyyyyyyyyyyyy
Presentation on backup and recoveryyyyyyyyyyyyyPresentation on backup and recoveryyyyyyyyyyyyy
Presentation on backup and recoveryyyyyyyyyyyyyTehmina Gulfam
 
Disaster Recovery Plan
Disaster Recovery Plan Disaster Recovery Plan
Disaster Recovery Plan Emilie Gray
 
Boomerang Total Recall
Boomerang Total RecallBoomerang Total Recall
Boomerang Total Recallbdoyle05
 
7_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-167_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-16Peak 10
 
Data backup and disaster recovery
Data backup and disaster recoveryData backup and disaster recovery
Data backup and disaster recoverycatacutanjcsantos
 
Presentation disaster recovery in virtualization and cloud
Presentation   disaster recovery in virtualization and cloudPresentation   disaster recovery in virtualization and cloud
Presentation disaster recovery in virtualization and cloudxKinAnx
 
Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...
Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...
Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...CloudSmartz
 
Business continuity and disaster recovery
Business continuity and disaster recoveryBusiness continuity and disaster recovery
Business continuity and disaster recoveryAdeel Javaid
 

What's hot (19)

Disaster Recovery Plan for IT
Disaster Recovery Plan for ITDisaster Recovery Plan for IT
Disaster Recovery Plan for IT
 
Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt
 
Recovery Time Objective and Recovery Point Objective
Recovery Time Objective and Recovery Point ObjectiveRecovery Time Objective and Recovery Point Objective
Recovery Time Objective and Recovery Point Objective
 
Varrow Madness 2014 DR Presentation
Varrow Madness 2014 DR PresentationVarrow Madness 2014 DR Presentation
Varrow Madness 2014 DR Presentation
 
Disaster Recovery Plan
Disaster Recovery PlanDisaster Recovery Plan
Disaster Recovery Plan
 
Membangun Data Recovery Center / Disaster Recovery Center
Membangun Data Recovery Center / Disaster Recovery CenterMembangun Data Recovery Center / Disaster Recovery Center
Membangun Data Recovery Center / Disaster Recovery Center
 
Select enterprise backup software
Select enterprise backup softwareSelect enterprise backup software
Select enterprise backup software
 
Disaster Recovery Plan
Disaster Recovery PlanDisaster Recovery Plan
Disaster Recovery Plan
 
Disaster Recovery Planning: Best Practices, Templates, and Tools
Disaster Recovery Planning: Best Practices, Templates, and ToolsDisaster Recovery Planning: Best Practices, Templates, and Tools
Disaster Recovery Planning: Best Practices, Templates, and Tools
 
HADRFINAL13112016
HADRFINAL13112016HADRFINAL13112016
HADRFINAL13112016
 
Presentation on backup and recoveryyyyyyyyyyyyy
Presentation on backup and recoveryyyyyyyyyyyyyPresentation on backup and recoveryyyyyyyyyyyyy
Presentation on backup and recoveryyyyyyyyyyyyy
 
Disaster Recovery Plan
Disaster Recovery Plan Disaster Recovery Plan
Disaster Recovery Plan
 
Boomerang Total Recall
Boomerang Total RecallBoomerang Total Recall
Boomerang Total Recall
 
7_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-167_Questions_DR_Plan_6-23-16
7_Questions_DR_Plan_6-23-16
 
Data backup and disaster recovery
Data backup and disaster recoveryData backup and disaster recovery
Data backup and disaster recovery
 
Presentation disaster recovery in virtualization and cloud
Presentation   disaster recovery in virtualization and cloudPresentation   disaster recovery in virtualization and cloud
Presentation disaster recovery in virtualization and cloud
 
Hot Disaster Recovery Using Zerto
Hot Disaster Recovery Using ZertoHot Disaster Recovery Using Zerto
Hot Disaster Recovery Using Zerto
 
Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...
Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...
Disaster Recovery & Business Resilience Trends - CloudSmartz | Smarter Transf...
 
Business continuity and disaster recovery
Business continuity and disaster recoveryBusiness continuity and disaster recovery
Business continuity and disaster recovery
 

Similar to Disaster Recovery and Reliability

Disaster recovery - What, Why, and How
Disaster recovery - What, Why, and HowDisaster recovery - What, Why, and How
Disaster recovery - What, Why, and HowManish Pandit
 
Deepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptxDeepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptxssuser20fcbe
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]jrm1224
 
Business Continuity Presentation
Business Continuity PresentationBusiness Continuity Presentation
Business Continuity Presentationperry57123
 
ProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USMudia Akpobome
 
Enterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankEnterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankactualtechmedia
 
Bluelock's Recovery Suite
Bluelock's Recovery SuiteBluelock's Recovery Suite
Bluelock's Recovery SuiteBluelock
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankDonna Perlstein
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankCloudEndure
 
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWSDisaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWSAmazon Web Services
 
Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology Andrew Miller
 
DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?MaryJWilliams2
 
Cloud Backup or Cloud Disaster Recovery – Key differences explained! | Sysfore
Cloud Backup or Cloud Disaster Recovery – Key differences explained! | SysforeCloud Backup or Cloud Disaster Recovery – Key differences explained! | Sysfore
Cloud Backup or Cloud Disaster Recovery – Key differences explained! | SysforeSysfore Technologies
 
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...MaryJWilliams2
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosionactifio
 
Disaster Recovery vs Data Backup what is the difference
Disaster Recovery vs Data Backup what is the differenceDisaster Recovery vs Data Backup what is the difference
Disaster Recovery vs Data Backup what is the differencejeetendra mandal
 
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...Amazon Web Services
 

Similar to Disaster Recovery and Reliability (20)

Disaster recovery - What, Why, and How
Disaster recovery - What, Why, and HowDisaster recovery - What, Why, and How
Disaster recovery - What, Why, and How
 
Deepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptxDeepak_ppt_ver1.0.pptx
Deepak_ppt_ver1.0.pptx
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
 
Business Continuity Presentation
Business Continuity PresentationBusiness Continuity Presentation
Business Continuity Presentation
 
ProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-USProfitBricks-white-paper-Disaster-Recovery-US
ProfitBricks-white-paper-Disaster-Recovery-US
 
Enterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bankEnterprise grade disaster recovery without breaking the bank
Enterprise grade disaster recovery without breaking the bank
 
Bluelock's Recovery Suite
Bluelock's Recovery SuiteBluelock's Recovery Suite
Bluelock's Recovery Suite
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the Bank
 
Enterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the BankEnterprise-Grade Disaster Recovery Without Breaking the Bank
Enterprise-Grade Disaster Recovery Without Breaking the Bank
 
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWSDisaster Recovery, Continuity of Operations, Backup, and Archive on AWS
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS
 
IBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERYIBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERY
 
Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology Disaster Recovery - Business & Technology
Disaster Recovery - Business & Technology
 
DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?DRaaS vs. On-Prem DR Appliance: Which is Right for You?
DRaaS vs. On-Prem DR Appliance: Which is Right for You?
 
Cloud Backup or Cloud Disaster Recovery – Key differences explained! | Sysfore
Cloud Backup or Cloud Disaster Recovery – Key differences explained! | SysforeCloud Backup or Cloud Disaster Recovery – Key differences explained! | Sysfore
Cloud Backup or Cloud Disaster Recovery – Key differences explained! | Sysfore
 
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
Shielding Data Assets: Exploring Data Protection and Disaster Recovery Strate...
 
CS_10_DR_CFD
CS_10_DR_CFDCS_10_DR_CFD
CS_10_DR_CFD
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
Disaster Recovery vs Data Backup what is the difference
Disaster Recovery vs Data Backup what is the differenceDisaster Recovery vs Data Backup what is the difference
Disaster Recovery vs Data Backup what is the difference
 
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...
Disaster Recovery, Continuity of Operations, Backup, and Archive on AWS | AWS...
 
Disaster recovery toolkit final version
Disaster recovery toolkit final versionDisaster recovery toolkit final version
Disaster recovery toolkit final version
 

More from Manish Pandit

Serverless Architectures on AWS in practice - OSCON 2018
Serverless Architectures on AWS in practice - OSCON 2018Serverless Architectures on AWS in practice - OSCON 2018
Serverless Architectures on AWS in practice - OSCON 2018Manish Pandit
 
Immutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and JenkinsImmutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and JenkinsManish Pandit
 
AWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and JavaAWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and JavaManish Pandit
 
AWS Primer and Quickstart
AWS Primer and QuickstartAWS Primer and Quickstart
AWS Primer and QuickstartManish Pandit
 
Securing your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID ConnectSecuring your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID ConnectManish Pandit
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsManish Pandit
 
Scalabay - API Design Antipatterns
Scalabay - API Design AntipatternsScalabay - API Design Antipatterns
Scalabay - API Design AntipatternsManish Pandit
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixManish Pandit
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFManish Pandit
 
Motivation : it Matters
Motivation : it MattersMotivation : it Matters
Motivation : it MattersManish Pandit
 
Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2Manish Pandit
 
Introducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGNIntroducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGNManish Pandit
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with ScalaManish Pandit
 
Object Oriented Programming
Object Oriented ProgrammingObject Oriented Programming
Object Oriented ProgrammingManish Pandit
 
Silicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you RESTSilicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you RESTManish Pandit
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBManish Pandit
 

More from Manish Pandit (20)

Serverless Architectures on AWS in practice - OSCON 2018
Serverless Architectures on AWS in practice - OSCON 2018Serverless Architectures on AWS in practice - OSCON 2018
Serverless Architectures on AWS in practice - OSCON 2018
 
OAuth2 primer
OAuth2 primerOAuth2 primer
OAuth2 primer
 
Immutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and JenkinsImmutable AWS Deployments with Packer and Jenkins
Immutable AWS Deployments with Packer and Jenkins
 
AWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and JavaAWS Lambda with Serverless Framework and Java
AWS Lambda with Serverless Framework and Java
 
AWS Primer and Quickstart
AWS Primer and QuickstartAWS Primer and Quickstart
AWS Primer and Quickstart
 
Securing your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID ConnectSecuring your APIs with OAuth, OpenID, and OpenID Connect
Securing your APIs with OAuth, OpenID, and OpenID Connect
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API Antipatterns
 
Scalabay - API Design Antipatterns
Scalabay - API Design AntipatternsScalabay - API Design Antipatterns
Scalabay - API Design Antipatterns
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SF
 
Motivation : it Matters
Motivation : it MattersMotivation : it Matters
Motivation : it Matters
 
Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2Building Apis in Scala with Playframework2
Building Apis in Scala with Playframework2
 
Scala at Netflix
Scala at NetflixScala at Netflix
Scala at Netflix
 
Introducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGNIntroducing Scala to your Ruby/Java Shop : My experiences at IGN
Introducing Scala to your Ruby/Java Shop : My experiences at IGN
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with Scala
 
IGN's V3 API
IGN's V3 APIIGN's V3 API
IGN's V3 API
 
Java and the JVM
Java and the JVMJava and the JVM
Java and the JVM
 
Object Oriented Programming
Object Oriented ProgrammingObject Oriented Programming
Object Oriented Programming
 
Silicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you RESTSilicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you REST
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
 

Recently uploaded

Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKUXDXConf
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoUXDXConf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfEasyPrinterHelp
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 

Recently uploaded (20)

Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 

Disaster Recovery and Reliability

  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Why Define and contextualize Disaster Recovery in a business and technical context without boiling the ocean. In other words, this is a very, very high level overview of a topic where each slide can easily be a session on it’s own.
  • 8. Availability A measure of % of time a service is in a usable state. Also measured in 9s. Scheduled downtimes do not count towards availability, but may impact customer satisfaction metrics (more so in a B2C model).
  • 9.
  • 10. Reliability A measure of the probability of the service being in a usable state for a period of time. Measured as MTBF (Mean Time Between Failures), and the Failure Rate
  • 11. Connecting Reliability & Availability A database goes down for an unscheduled maintenance for an hour Availability = 98% (or 1 Nine) Reliability = 23 hours MTBF = 23 hours; as I can rely on that db for only 23 hours.
  • 13. BCP Business Continuity Plan “Business continuity planning (or business continuity and resiliency planning) is the process of creating systems of prevention and recovery to deal with potential threats to a company.” - Wikipedia Usually owned and managed by the COO
  • 14. Disaster Recovery Disaster Recovery starts where High Availability stops.
  • 15. Disaster Recovery Disaster Recovery is a component of BCP, covering the technical/infrastructure area. Usually owned and managed by the CTO/CIO.
  • 16. But...how do we put metrics around Disaster Recovery Plan?
  • 17. RPO Recovery Point Objective The maximum amount of data loss that is tolerable without significant impact to business continuity. Always defined backwards in time. Ideal value = 0
  • 18. RPO If the RPO is 4 hours, it’d mean you must have (good) backups of data no older than 4 hours. Think about your laptop. How much far back in time you can go where any data loss beyond that time is tolerable?
  • 19. RTO Recovery Time Objective Wider than RPO - Covers more than just data. The maximum amount of time the system can remain unavailable without significant impact to the business continuity. Ideal value = 0
  • 21. RTO and RPO If it takes 2 hours to restore the last backup that was done 4 hours ago, then RTO is >= 2 hours, and RPO is >= 4 hours. If a master fails, and the slave is 10 minutes behind, your RPO cannot be < 10 minutes. If the application needs to be bounced to update the db connections which takes 10 minutes, then the RTO cannot be < 10 minutes.
  • 22. PTO Paid Time Off following the the disaster recovery. *It is more or less a convention to throw PTO in there.
  • 23. Who decides RTO and RPO? The business does.
  • 24. That’s easy - get me zero RTO and RPO Zero RTO and/or RPO is realistically impossible. (why?) The business has to establish the tolerable RTO and RPO. This acts as a requirements-spec for the DR Plan and Implementation. These limits also help establish the SLA with customers.
  • 25. Tolerable? For a bank, an RPO greater than a few minutes = lost transactions. For an online broker, an RTO greater than a few minutes = lost trades. For a media company, RTO greater than a few minutes = angry tweets. For a static website, weekly backups are acceptable with a RPO of 1 week. For an HR system, RPO greater than a day may be acceptable, but RTO greater than a few hours may not.
  • 26. Hybrid Cloud Most companies run a hybrid cloud, which means the infrastructure is split (usually disproportionately) between on-prem and public cloud.
  • 27. Common Failures Network backbone/ISP Outage Software Bugs Storage Controller/NFS Crashes Disruptive changes to security settings/firewalls Corrupt DNS configuration being replicated AWS/Public Cloud Outage
  • 28.
  • 29. Backup & Restore Regular backups are copied to the recovery site. Infrastructure has to be spun up on the recovery site in the event of a disaster. RPO and RTO can be in hours, if not days. Inexpensive - Costs few hundred dollars a month for the storage.
  • 30. Pilot Light Infrastructure is provisioned, but needs to be started before taking any traffic (RTO!) Data replication may be a few seconds/minutes behind (RPO!) Lower RTO and RPO than Backup & Restore, a bit more $$ for replication.
  • 31. Warm Standby Infrastructure is provisioned, ready to take on traffic. It may need to be scaled up to handle full production load. Data replication may be a few seconds/minutes behind (RPO!) Lower RTO than Pilot Light, more $$ (why?)
  • 32. Multi-Site Multiple sites taking live production traffic Difficult to pull off due to database constraints (multi-master, anyone?) When done right, RPO and RTO of a few seconds to few minutes Costs an arm and a leg
  • 33. Multi Cloud Mother of them all. Automation to support multiple cloud providers, plus on- prem. RPO and RTO similar to multi-site, but provides isolation at a provider level. Costs an arm, a leg, and a kidney.
  • 34. So...
  • 35. Survey the Land Start with measuring your current RTO and RPO.
  • 36. Gather data You cannot improve what you cannot measure. Bonus - Detect anomalies across the board.
  • 37. Runbooks Write them, and keep them updated.
  • 38. Review your automation Follow the Pull-request model for infrastructure changes. Automating a destructive script (unintentionally) is the quickest way to a disaster. foreach ($env == ‘prod’); sudo chmod -R -rx
  • 40. Failure-as-a-service Inject failures in the infrastructure. Measure of readiness. Chaos Engineering. Netflix - Simian Army Amazon Aurora Failure Injections
  • 41. Not all components are equal - neither should their DRs
  • 42. Blast Radius A DNS failure can take down an entire data center. A faulty switch can take down entire subnet. A service failure can take down all others dependent on it. A Region failure has larger blast radius than an Availability Zone failure A Provider failure has larger blast radius than a Region failure.
  • 43. Design for Fault Tolerance and Graceful Degradation Use evented processing vs. synchronous wherever possible
  • 44. Dashboards - Internal and External Service health monitoring is critical.. ..so is ensuring that the monitors themselves can survive a disaster.
  • 45. Finally Make disaster recovery and high availability a topic of discussion during every stage of a project. Ask the hard questions. Embrace failure - learn from it.