Disaster recovery - What, Why, and How

Disaster Recovery
What, Why and How
Manish Pandit
Silicon Valley Code Camp, 2018

Why
Define and contextualize Disaster Recovery in a business and technical context
without boiling the ocean.
In other words, this is a very, very high level overview of a topic where each slide
can easily be a session on it’s own.

About Me
Manish Pandit
Sr. Director of Engineering at Marqeta
@lobster1234
lobster1234.github.io

Availability
A measure of % of time a service is in a usable state.
Also measured in 9s.
Scheduled downtimes do not count towards availability, but may impact customer
satisfaction metrics (more so in a B2C model).

Uptime
Often interchangeable with Availability
Gotcha: Uptime does not mean much if the server cannot serve requests

Reliability
A measure of the probability of the service being in a usable state for a period of
time.
Mean Time to Failure (MTTF)
Mean Time to Repair (MTTR)
Mean Time between Failures (MTBF)
Mostly used for hardware such as Network/IO controllers, power supplies, etc.

Reliability
“A rack switch goes unresponsive for 28 mins every day”
MTTF = 23 hours 32 minutes
MTTR = 28 minutes
MTBF = 24 hours (MTTF + MTTR)

BCP
Business Continuity Plan
“Business continuity planning (or business continuity and resiliency planning) is the
process of creating systems of prevention and recovery to deal with potential
threats to a company.” - Wikipedia
Usually owned and managed by the COO

Disaster Recovery
Disaster Recovery starts where High Availability stops.

Disaster Recovery
Disaster Recovery is a component of BCP, covering the technical/infrastructure
aspects.
Usually owned and managed by the CTO/CIO.

But...how do we put metrics around Disaster Recovery Plan?

RPO
Recovery Point Objective
The maximum amount of data loss that is tolerable without significant impact to
business continuity.
Always defined backwards in time.
Ideal value = 0

RPO
If the RPO is 4 hours, it’d mean you must have (good) backups of data no older
than 4 hours.
Think about your laptop. How much far back in time you can go where any data
loss beyond that time is tolerable?

RTO
Recovery Time Objective
Wider than RPO - Covers more than just data.
The maximum amount of time the system can remain unavailable without
significant impact to the business continuity.
Ideal value = 0

RTO and RPO
If it takes 2 hours to restore the last backup that was done 4 hours ago, then RTO
is >= 2 hours, and RPO is >= 4 hours.
If a master fails, and the slave is 10 minutes behind, your RPO cannot be < 10
minutes. If the application needs to be bounced to update the db connections
which takes 10 minutes, then the RTO cannot be < 10 minutes.

PTO*
Paid Time Off following the the disaster recovery.
*It is more or less a convention to throw PTO in there.

Who decides RTO and RPO?
The business does.

That’s easy - get me zero RTO and RPO
Zero RTO and/or RPO is realistically impossible. (why?)
The business has to establish the tolerable RTO and RPO.
This acts as a requirements-spec for the DR Plan and Implementation.
These limits also help establish the SLA with customers.

Tolerable?
For a bank, an RPO greater than a few minutes = lost transactions.
For an online broker, an RTO greater than a few minutes = lost trades.
For a media company, RTO greater than a few minutes = angry tweets.
For a static website, weekly backups are acceptable with a RPO of 1 week.
For an HR system, RPO greater than a day may be acceptable, but RTO greater
than a few hours may not.

Common Failures
Network backbone/ISP Outage
Software Bugs
Storage Controller/NFS Crashes
Disruptive changes to security settings/firewalls
Corrupt DNS configuration being replicated
AWS/Public Cloud Outage

Hybrid Cloud
Most companies run a hybrid cloud, which means the infrastructure is split (usually
disproportionately) between on-prem and public cloud.

Backup & Restore
Regular backups are copied to the recovery site.
Infrastructure has to be spun up on the recovery site in the event of a disaster.
RPO and RTO can be in hours, if not days.
Inexpensive - Costs few hundred dollars a month for the storage.

Pilot Light
Data is replicated asynchronously to the failover site
Infrastructure is provisioned, but needs to be started before taking any traffic
(RTO!)
Data replication may be a few seconds/minutes behind (RPO!)
Lower RTO and RPO than Backup & Restore, a bit more $$ for replication.

Warm Standby
Scaled down infrastructure is provisioned, running, ready to take on traffic.
May need to be scaled up to handle full production load (Autoscale!)
Data replication may be a few seconds/minutes behind (RPO!)
Lower RTO than Pilot Light, more $$ (why?)

Multi-Site
Multiple sites taking live production traffic
Difficult to pull off due to database constraints (multi-master, anyone?)
When done right, RPO and RTO of a few seconds to few minutes
Costs an arm and a leg

Multi Cloud
Mother of them all.
Automation to support multiple cloud providers, plus on- prem.
RPO and RTO similar to multi-site, but provides isolation at a provider level.
Costs an arm, a leg, and a kidney.

Fail Back
Reverse the data flow
Freeze the DR site
Route traffic to primary site
Unfreeze the DR site

Survey the Land
Start with measuring your current RTO and RPO.

Gather data
You cannot improve what you cannot measure.
Bonus - Detect anomalies across the board.

Runbooks
Write them, and keep them updated.

Review your automation
Automate the infrastructure build out, IaaC
Follow the Pull-request model for infrastructure changes.
Automating a destructive script (unintentionally) is the quickest way to a disaster.
foreach ($env == ‘prod’); sudo chmod -R -rx

Failure-as-a-service
Inject failures in the infrastructure.
Measure of readiness.
Chaos Engineering.
Netflix - Simian Army
Amazon Aurora Fault Injection Queries

Not all components are equal - neither should their DRs

Blast Radius
A DNS failure can take down an entire data center.
A faulty switch can take down entire subnet.
A service failure can take down all others dependent on it.
A Region failure has larger blast radius than an Availability Zone failure
A Provider failure has larger blast radius than a Region failure.

Design for Fault Tolerance and Graceful Degradation
Prefer evented over synchronous processing wherever possible
Always assume failure
In the cloud, there are no edge cases

Dashboards - Internal and External
Service health monitoring is critical..
..so is ensuring that the monitors themselves can survive a disaster.

Finally
Make disaster recovery and high availability a topic of discussion during every
stage of a project.
Ask the hard questions.
Embrace failure - learn from it.

Thank you!
Manish Pandit
Sr. Director of Engineering at Marqeta
@lobster1234
lobster1234.github.io

Disaster recovery - What, Why, and How

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Disaster recovery - What, Why, and How

Similar to Disaster recovery - What, Why, and How (20)

More from Manish Pandit

More from Manish Pandit (20)

Recently uploaded

Recently uploaded (20)

Disaster recovery - What, Why, and How