Disaster Recovery Strategies with Config Management

  • 1,094 views
Uploaded on

Presented at CfgMgmtCamp, Ghent, BE. 3 FEB 2014.

Presented at CfgMgmtCamp, Ghent, BE. 3 FEB 2014.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,094
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
11
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. DR Strategies with CM Mandi Walls CfgMgmtCamp 3 FEB 2014 Monday, February 3, 14
  • 2. whoami • Mandi Walls • Technical Practice Manager, CHEF • mandi@getchef.com • @lnxchk Monday, February 3, 14
  • 3. What is Disaster Recovery http://www.flickr.com/photos/61617934@N03/6196510705/sizes/z/in/photostream/ Monday, February 3, 14
  • 4. Reasons to Make DR Plans • Your business insurance requires it • Things are going to happen, whether you are ready or not Monday, February 3, 14
  • 5. Tornado Events in Loudoun County, VA http://www.tornadohistoryproject.com/tornado/Virginia/Loudoun/map Monday, February 3, 14
  • 6. Tornado Events in Loudoun County, VA September 17, 2004 3:55 pm http://www.tornadohistoryproject.com/tornado/Virginia/Loudoun/map Monday, February 3, 14
  • 7. Tornado Events in Loudoun County, VA September 17, 2004 3:55 pm http://www.tornadohistoryproject.com/tornado/Virginia/Loudoun/map Monday, February 3, 14
  • 8. Tornado Events in Loudoun County, VA September 17, 2004 3:55 pm http://www.tornadohistoryproject.com/tornado/Virginia/Loudoun/map Monday, February 3, 14
  • 9. Tornado Events in Loudoun County, VA September 17, 2004 3:55 pm Everybody Else http://www.tornadohistoryproject.com/tornado/Virginia/Loudoun/map Monday, February 3, 14
  • 10. Hurricane Sandy, NYC, October 2012 Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 11. Hurricane Sandy, NYC, October 2012 33 Whitehall Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 12. Hurricane Sandy, NYC, October 2012 60 Hudson 33 Whitehall Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 13. Hurricane Sandy, NYC, October 2012 375 Pearl 60 Hudson 33 Whitehall Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 14. Hurricane Sandy, NYC, October 2012 375 Pearl 60 Hudson 65 Broadway 33 Whitehall Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 15. Hurricane Sandy, NYC, October 2012 375 Pearl 60 Hudson 65 Broadway 33 Whitehall 25 Broadway Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 16. Hurricane Sandy, NYC, October 2012 111 8th 60 Hudson 65 Broadway 375 Pearl 33 Whitehall 25 Broadway Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 17. Hurricane Sandy, NYC, October 2012 111 8th 60 Hudson 65 Broadway 25 Broadway 375 Pearl 33 Whitehall 75 Broad Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 18. Hurricane Sandy, NYC, October 2012 111 8th 121 Varick 60 Hudson 65 Broadway 25 Broadway 375 Pearl 33 Whitehall 75 Broad Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 19. Hurricane Sandy, NYC, October 2012 111 8th 121 Varick 60 Hudson 65 Broadway 25 Broadway 375 Pearl 33 Whitehall 75 Broad My Apartment Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 20. Hurricane Sandy, NYC, October 2012 111 8th 121 Varick 60 Hudson 65 Broadway 25 Broadway Bitches in BPC with newer infrastructure 375 Pearl 33 Whitehall 75 Broad My Apartment Photo: Iwan Baan and New York Magazine Monday, February 3, 14
  • 21. Current State of DR • Event horizon for modern DR was 9/11 • Same neighborhood as Hurricane Sandy • Most of the literature reflects the state of IT at that time Monday, February 3, 14
  • 22. Goals of DR Planning • Name staff and services that are key to business continuity • Provide clear guidance for making decisions in real time • Set rules for escalation, communication, participation • Document all of these things, publish the results, keep them updated on a regular basis Monday, February 3, 14
  • 23. Advantages of CM when Planning DR • Topology and service definition • Settings and relationships • Documentation • Tooling and workflows Monday, February 3, 14
  • 24. Old Rules that Still Apply • Accessible off site backups, with periodically tested restores • Documentation should also be available if your normal services are not • Documents need to be updated on a regular schedule, and personnel should be trained on their potential roles Monday, February 3, 14
  • 25. New Rules http://www.flickr.com/photos/26058810@N02/5650149188/sizes/z/in/photostream/ Monday, February 3, 14
  • 26. Rule 1: Your availability is your responsibility • Cloud / managed hosting allows us to outsource a number of worries • Bandwidth, power, cooling • That’s awesome, but does your vendor care as much about your customers or users as you do? • You must assess your tolerance for risk vs cost • No longer entirely dependent on getting budget for full scale “DR sites” Monday, February 3, 14
  • 27. Rule 1: To the Cloud! • Justifying DR planning is much easier without justifying massive quantities of capital for emergency capacity • If your applications are not tightly coupled to custom services by your IaaS provider, your flexibility in outage events is increased • Commonly missed items include • Keeping passwords in a single location that may be inaccessible in outages • Not having the most correct information about operating systems or server capacities that will be needed, and how to translate among providers • Not engaging with security and network teams to ensure all access is ok Monday, February 3, 14
  • 28. Knife Plugins $ knife rackspace server create (options) $ knife linode server create (options) $ knife ec2 server create (options) Monday, February 3, 14
  • 29. Rule 2: Assessing realistic risk • Do not bikeshed all possible events along all potential space-time continua • Assess risk based on affected services http://badassoftheweek.com/godzilla.html Monday, February 3, 14
  • 30. Rule 2: Planning for the Extent of an Event • Service level • Datacenter level • Regional level • National level Monday, February 3, 14
  • 31. Service-Level and Datacenter-Level Events • These are the easiest to deal with when you’re using CM! • If your infrastructure is in code, move services to new blades of grass by redeploying Monday, February 3, 14
  • 32. Spiceweasel • https://github.com/mattray/spiceweasel • Define groups of infrastructure in Ruby, JSON, or YAML • Spiceweasel will translate into knife commands to recreate the running infrastructure Monday, February 3, 14
  • 33. Spiceweasel nodes: - serverA: run_list: role[base] options: -i ~/.ssh/mray.pem -x user --sudo - serverB serverC: run_list: role[base] options: -i ~/.ssh/mray.pem -x user --sudo -E production - windows_winrm winboxA: run_list: role[base],role[iisserver] options: -x Administrator -P 'super_secret_password' - windows_ssh winboxB winboxC: run_list: role[base],role[iisserver] options: -x Administrator -P 'super_secret_password' Monday, February 3, 14
  • 34. Regional Events • Storms, volcanoes, large telecom cuts, worker strikes, etc • When regional civil infrastructure is affected • May provide more warning - hurricanes may take several days to form • Your staff may be without power or the ability to be physically present in your office or datacenter • Prioritization of services, training of backup staff Monday, February 3, 14
  • 35. National Events • Political unrest • Other large natural disasters • Decide if you even need a strategy for these cases • If your service is down, but all of your customers are also offline, does it make sense to pursue an extensive plan? Monday, February 3, 14
  • 36. Kind of a Bummer http://i.imgur.com/CH5J6Uz.jpg Monday, February 3, 14
  • 37. Rule 3: Comprehensive plans require all players • You may find yourself faced with an event in which your organization is able to only provide Minimum Viable Product-level services • Scaling back services to only critical core components requires decision making and planning by product, dev, ops, security, etc • Minimize the need to also bring along extraneous services like VPNs and specialized gear Monday, February 3, 14
  • 38. Getting an MVP Up App LBs Cache App Servers DB Cache DB slaves DBs Monday, February 3, 14
  • 39. Getting an MVP Up App LBs Baseline Capacity Cache App Servers DB Cache DB slaves DBs Monday, February 3, 14 Baseline Capacity
  • 40. Getting an MVP Up App LBs Baseline Capacity Cache App Servers DB Cache Maintain Interfaces? DB slaves DBs Monday, February 3, 14 Baseline Capacity
  • 41. Tackling a Reduced Topology • Container for metadata related to the DR topology • Chef environment, data bags for storing new info • Separate from existing infrastructure metadata http://www.flickr.com/photos/psd/9626226855/sizes/z/in/photostream/ Monday, February 3, 14
  • 42. DR Environment • In Chef, an environment is a logical grouping for nodes • Environments belonging to the same organization share other Chef components like cookbooks and role definitions • The environment allows you to customize settings for the nodes that live in the environment Monday, February 3, 14
  • 43. DR Environment $ cat environments/dr.rb name “dr-app1” description “DR for App1” override_attributes( :app1 => { :db_conn => “ro” } ) Monday, February 3, 14
  • 44. Rule 4: Prioritize • Determine the hierarchy of all critical services • Your list may have a different order depending on: • Day of week / month / quarter - is accounting software P1 on the 10th of the month? • Length of outage - can a service be down a short time with fewer risks? • Amount of time necessary to recover - how long will it take your data analytics system to catch up after an outage of N hours? More than N additional hours? Monday, February 3, 14
  • 45. User Behavior App 1 App1 Avg 150 112.5 75 37.5 0 0600 0800 1000 1200 1400 1600 1800 2000 2200 0000 0200 0400 0600 Monday, February 3, 14
  • 46. Managing Complexity • Your CM tool is composed of atomic units representing your infrastructure • Rely on those to help you manage the additional complexity of instantiating new resources in emergencies • All relationships should be well defined and encoded in the CM tools • Eliminate the need for specialized knowledge for your DR planning Monday, February 3, 14
  • 47. Rule 5: Don’t plan for heroism • When catastrophic events occur, safety of your people is primary • Large events affect the availability of people resources • If your staff has reason to be concerned for their welfare, or the welfare of their families, those are priorities Monday, February 3, 14
  • 48. DR for People • Resist the urge to hide your config management from different teams • You can’t predict which members of your team will be able to help Monday, February 3, 14
  • 49. Checklist • Identify providers to be used in the case of an outage • Are you going to use AWS? Use idle or under utilized infrastructure in other locations? Will there be DNS changes, etc? • Make sure all accounts, billing, and personnel access are up to date • Check this on a regular basis. Add new staff to access lists promptly. • All new service deployments must include emergency plan • Plan for your primary folks to be unavailable Monday, February 3, 14
  • 50. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 51. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 52. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 53. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 54. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 55. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 56. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 57. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 58. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 59. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 60. TL;DR • Start with baseline • Add components over time • Rebuild and return to initial infrastructure if / when possible Monday, February 3, 14
  • 61. Other Stuff to Take into Consideration • SaaS solutions for temporary infrastructures • Monitoring and metrics, CDNs, code repositories • Also for backoffice: email services, document storage • Often scary for security and compliance folks • Speed time to recovery in large-loss events Monday, February 3, 14
  • 62. fin • Time to rewrite DR practices for new generation of tools and services • Send me your stories if you can share mandi@getchef.com http://i.imgur.com/KdRnwZK.jpg Monday, February 3, 14