Disaster Recovery Strategies with Config Management

DR Strategies with CM
Mandi Walls
CfgMgmtCamp
3 FEB 2014

Monday, February 3, 14

whoami
• Mandi Walls
• Technical Practice Manager, CHEF
• mandi@getchef.com
• @lnxchk


What is Disaster Recovery

http://www.ﬂickr.com/photos/61617934@N03/6196510705/sizes/z/in/photostream/

Reasons to Make DR Plans
• Your business insurance requires it
• Things are going to happen, whether you are ready or not


Tornado Events in Loudoun County, VA

http://www.tornadohistoryproject.com/tornado/Virginia/Loudoun/map


September 17,
2004 3:55 pm



September 17,
2004 3:55 pm

Everybody Else



Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine


33 Whitehall



60 Hudson

33 Whitehall


375 Pearl
60 Hudson

33 Whitehall


375 Pearl
60 Hudson
65 Broadway

33 Whitehall


375 Pearl
60 Hudson
65 Broadway

33 Whitehall

25 Broadway


111 8th
60 Hudson
65 Broadway

375 Pearl
33 Whitehall

25 Broadway


111 8th
60 Hudson
65 Broadway
25 Broadway

375 Pearl
33 Whitehall
75 Broad


111 8th
121 Varick
60 Hudson
65 Broadway
25 Broadway

375 Pearl
33 Whitehall
75 Broad


111 8th
121 Varick
60 Hudson
65 Broadway
25 Broadway

375 Pearl
33 Whitehall
75 Broad

My Apartment

111 8th
121 Varick
60 Hudson
65 Broadway
25 Broadway

Bitches in BPC with newer infrastructure

375 Pearl
33 Whitehall
75 Broad

My Apartment


Current State of DR
• Event horizon for modern DR was 9/11
• Same neighborhood as Hurricane Sandy

• Most of the literature reflects the state of IT at that time


Goals of DR Planning
• Name staff and services that are key to business continuity
• Provide clear guidance for making decisions in real time
• Set rules for escalation, communication, participation
• Document all of these things, publish the results, keep them updated
on a regular basis


Advantages of CM when Planning DR
• Topology and service definition
• Settings and relationships
• Documentation
• Tooling and workflows


Old Rules that Still Apply
• Accessible off site backups, with periodically tested restores
• Documentation should also be available if your normal services are
not
• Documents need to be updated on a regular schedule, and personnel
should be trained on their potential roles


New Rules

http://www.ﬂickr.com/photos/26058810@N02/5650149188/sizes/z/in/photostream/

Rule 1: Your availability is your responsibility
• Cloud / managed hosting allows us to outsource a number of worries
• Bandwidth, power, cooling

• That’s awesome, but does your vendor care as much about your
customers or users as you do?
• You must assess your tolerance for risk vs cost
• No longer entirely dependent on getting budget for full scale “DR sites”


Rule 1: To the Cloud!
• Justifying DR planning is much easier without justifying massive
quantities of capital for emergency capacity
• If your applications are not tightly coupled to custom services by your
IaaS provider, your flexibility in outage events is increased
• Commonly missed items include
• Keeping passwords in a single location that may be inaccessible in outages
• Not having the most correct information about operating systems or server
capacities that will be needed, and how to translate among providers
• Not engaging with security and network teams to ensure all access is ok


Knife Plugins
$ knife rackspace server create (options)
$ knife linode server create (options)
$ knife ec2 server create (options)


Rule 2: Assessing realistic risk
• Do not bikeshed all possible events along all
potential space-time continua
• Assess risk based on affected services

http://badassoftheweek.com/godzilla.html

Rule 2: Planning for the Extent of an Event
• Service level
• Datacenter level
• Regional level
• National level


Service-Level and Datacenter-Level Events
• These are the easiest to deal with when you’re using CM!
• If your infrastructure is in code, move services to new blades of grass
by redeploying


Spiceweasel
• https://github.com/mattray/spiceweasel
• Define groups of infrastructure in Ruby, JSON, or YAML
• Spiceweasel will translate into knife commands to recreate the
running infrastructure


Spiceweasel
nodes:
- serverA:
run_list: role[base]
options: -i ~/.ssh/mray.pem -x user --sudo
- serverB serverC:
run_list: role[base]
options: -i ~/.ssh/mray.pem -x user --sudo -E production
- windows_winrm winboxA:
run_list: role[base],role[iisserver]
options: -x Administrator -P 'super_secret_password'
- windows_ssh winboxB winboxC:
run_list: role[base],role[iisserver]
options: -x Administrator -P 'super_secret_password'

Regional Events
• Storms, volcanoes, large telecom cuts, worker strikes, etc
• When regional civil infrastructure is affected
• May provide more warning - hurricanes may take several days to form
• Your staff may be without power or the ability to be physically present
in your office or datacenter
• Prioritization of services, training of backup staff


National Events
• Political unrest
• Other large natural disasters
• Decide if you even need a strategy for these cases
• If your service is down, but all of your customers are also offline, does it make
sense to pursue an extensive plan?


Kind of a Bummer

http://i.imgur.com/CH5J6Uz.jpg

Rule 3: Comprehensive plans require all players
• You may find yourself faced with an event in which your organization
is able to only provide Minimum Viable Product-level services
• Scaling back services to only critical core components requires
decision making and planning by product, dev, ops, security, etc
• Minimize the need to also bring along extraneous services like VPNs
and specialized gear


Getting an MVP Up
App LBs
Cache
App Servers
DB Cache
DB slaves
DBs


Getting an MVP Up
App LBs

Baseline Capacity
Cache
App Servers
DB Cache
DB slaves

DBs


Baseline Capacity

Getting an MVP Up
App LBs

Baseline Capacity
Cache
App Servers
DB Cache

Maintain Interfaces?

DB slaves
DBs


Baseline Capacity

Tackling a Reduced Topology
• Container for metadata related to the DR topology
• Chef environment, data bags for storing new info
• Separate from existing infrastructure metadata

http://www.ﬂickr.com/photos/psd/9626226855/sizes/z/in/photostream/

DR Environment
• In Chef, an environment is a logical grouping for nodes
• Environments belonging to the same organization share other Chef
components like cookbooks and role definitions
• The environment allows you to customize settings for the nodes that
live in the environment


DR Environment
$ cat environments/dr.rb
name “dr-app1”
description “DR for App1”
override_attributes(
:app1 => {
:db_conn => “ro”
}
)

Rule 4: Prioritize
• Determine the hierarchy of all critical services
• Your list may have a different order depending on:
• Day of week / month / quarter - is accounting software P1 on the 10th of the
month?
• Length of outage - can a service be down a short time with fewer risks?
• Amount of time necessary to recover - how long will it take your data analytics
system to catch up after an outage of N hours? More than N additional hours?


User Behavior
App 1

App1 Avg

150

112.5

75

37.5

0
0600 0800 1000 1200 1400 1600 1800 2000 2200 0000 0200 0400 0600

Managing Complexity
• Your CM tool is composed of atomic units representing your
infrastructure
• Rely on those to help you manage the additional complexity of
instantiating new resources in emergencies
• All relationships should be well defined and encoded in the CM tools
• Eliminate the need for specialized knowledge for your DR planning


Rule 5: Don’t plan for heroism
• When catastrophic events occur, safety of your people is primary
• Large events affect the availability of people resources
• If your staff has reason to be concerned for their welfare, or the
welfare of their families, those are priorities


DR for People
• Resist the urge to hide your config management from different teams
• You can’t predict which members of your team will be able to help


Checklist
• Identify providers to be used in the case of an outage
• Are you going to use AWS? Use idle or under utilized infrastructure in other
locations? Will there be DNS changes, etc?

• Make sure all accounts, billing, and personnel access are up to date
• Check this on a regular basis. Add new staff to access lists promptly.

• All new service deployments must include emergency plan
• Plan for your primary folks to be unavailable


TL;DR
• Start with baseline
• Add components
over time
• Rebuild and return to
initial infrastructure
if / when possible


Other Stuff to Take into Consideration
• SaaS solutions for temporary infrastructures
• Monitoring and metrics, CDNs, code repositories
• Also for backoffice: email services, document storage

• Often scary for security and compliance folks
• Speed time to recovery in large-loss events


fin
• Time to rewrite DR practices for new
generation of tools and services
• Send me your stories if you can share
mandi@getchef.com

http://i.imgur.com/KdRnwZK.jpg

Disaster Recovery Strategies with Config Management

Recommended

Recommended

More Related Content

Similar to Disaster Recovery Strategies with Config Management

Similar to Disaster Recovery Strategies with Config Management (20)

More from Mandi Walls

More from Mandi Walls (20)

Recently uploaded

Recently uploaded (20)

Disaster Recovery Strategies with Config Management