Real World DR Testing
Ben Acord
Why?
Connectivity
Performance
The Usual Suspects - Roles and Responsibilities
Bookend Every Test
Test
Plan
Review
Plan
Update
Plan
[1] READ-THROUGH Test
[2] STRUCTURED WALK-THROUGH Test
Find Gaps
[3] SIMULATION Test
[4] PARALLEL Test
Copyright © 2017 Ben Acord. All rights reserved.
[5] FULL - INTERRUPTION Test
Celebrate !
Copyright © 2017 Ben Acord. All rights reserved.
Post-Mortem Lessons Learned
Shred Old Plans
(1) Read-Through / Checklist
Most important. Key personnel
separately review current plan.
(2) Walk-Through
Tabletop exercise of key
personnel to role play a
critical scenario.
(3) Simulation
DR team R&D. Benchmark.
Possible minor interruptions.
(4) Parallel
No interruptions. All
activity at alternate site.
(5) Full-Interruption
Recover select business operations
to alternate recovery site.
DR Testing Quick Reference
THANKS!
Any questions?

2017 07 - Real World DR Testing

Editor's Notes

  • #2 Lessons learned from each type of testing in relation to disasters.
  • #3 Business exists to benefit the customers, owners, and employees. Testing raises the bar for the type of disaster which could end the business. There are legal reasons why as well. Sarbanes-Oxley Act of 2002 put DR on the agenda for many executive board meetings throughout the United States. This was one of the major drivers for my own annual DR testing at SunGard in Philadelphia throughout the 2000s. Feeling a bit old but those were good days; stressful but no other business practice honed my technical and business acumen as DR testing. At some point maybe all IT systems will reach the promised land of microservice resilience and and seamless DevOps workflow. But for now, in many rack units of our data centers we have systems and applications which require a human touch for recovery. << STORY: Company running production from SunGard >> NIST SP 800-34 rev 1: http://csrc.nist.gov/publications/PubsSPs.html U.S. Department of the Interior listing: https://www.doi.gov/recovery/about-recovery/disaster-laws OSHA also has guidance too: https://www.osha.gov/pls/oshaweb/owadisp.show_document?p_id=9726&p_table=standards
  • #4 The Disaster Recovery team is a collection of several key roles in an organization. By designating the roles and responsibilities in the DR team each person is able to focus on their respective tasks without unnecessary distraction. It’s difficult to discuss ideal conditions when disaster events will do everything to throw roles out the door. Testing has a key advantage over an actual disaster in its communication matrix often called a RACI for Responsible, Accountable, Consulted, and Informed. This will pay off immensely for everyone involved in a real event, too. Management should interface directly with and receive regular status update communication from one or more coordinators. Likewise, those responsible for technical recovery tasks should keep the coordinator regularly updated. The toughest role is that of the coordinator. These folks need to be tough and empowered to push back in both directions. They need to buffer the potential for micro management direction from the top and a lack of communication from those at the alternate site. The coordinator facilitates all of these tests and the scenarios involved.
  • #5 Regardless of the type of test performed the workflow should be the same. The DR team reviews the plan prior to a scheduled test event, they perform the test, and finally, they update the plan with lessons learned. This simple continuous improvement bookends every test we’ll look at today.
  • #6 Also called a checklist test, a way of ticking off the required tasks. Each reviewer takes a copy of the DR plan and verifies the content. They mark up sections needing removal or updating, hopefully also indicating when and who should be the editors. If a topic is stated at a high-level and does not have a corresponding detailed, step by step task it should be flagged for revision. While team leads are often those selected to perform this test it is best to remember that an SME may not be the one performing the recovery task. Is it a task documented such that a junior sysadm or netadm could follow the plan? The best SMEs don’t fear sharing their brainery. As with all tests there must be measurable metrics gathered to determine success or failure. In the case of a read-through test we want to set deadlines for document review. We’re building up to bigger and much more involved tests all of which hinge on this first review. So here we have IT Operations managers, the DR testing team, coordinators. << STORY: Mainframe binder >> Checkbox journal Photo by Glenn Carstens-Peters on Unsplash
  • #7 If any of you are familiar with tabletop gaming such as Dungeons and Dragons, Pathfinder, Settlers of Catan; this is very similar. The Coordinator acts as a game master of sorts by putting together a believable scenario for the team to practice. At this level of testing the scenario is usually singular. The goal is to work out the kinks of communication and coordination of activities. You see each of us look at that image and this this is light and fluffy. But if handled correctly it can reveal some major gaps. << STORY: The best structured walk-through of my career >>
  • #8 Hurricane Katrina showed that alternate facilities may be affected to a similar or worse degree as the primary. As you read through the phases of recovery with your DR team think critically about assumptions. What are we assuming will be available? Is that reasonable in the given scenario?
  • #9 Not all simulations are created equal. If the business impact analysis has actually revealed that operations are threatened by large reptiles, by all means, suit up. For the rest of us the scenarios will directly related to interruptions with our supply chain, financial processing, employee availability, and other use cases. Resist the urge to water down or turn the simulation into a play time for an easy gold star. It doesn’t take much of a disaster to reveal our true level of preparedness. This is the first test which could interrupt a business operation. Usually a simulation test is limited in scope to a single business solution and its respective systems. The coordinator presents a scenario for the DR team to research and develop a solution for availability or resilience. The big take away here is benchmarking is crucial. The more test cycles run the more accurate the range of benchmarks. Simulations are the best way to decrease recovery time objective (RTO). If your benchmarks indicate times for each phase of recovery your team will be able to R&D possible improvements. Results from all tests roll up into lessons learned, documentation updates to the DR plan, and potentially the technical processes for backing up data. A best practice is to provide some form of lab or in-house data center isolation to test these scenarios. Give your DR team flexibility and structure for when and how to refine the backup and recovery processes. << STORY: SME Hit by the Proverbial Beer/Milk Truck >> Photo: flickr.com/photos /89330362 @ N03 / 8134220969
  • #10 Business as usual for most of the company. But not so for the DR team. The DR team heads to the alternate site (physically or virtually depending upon the plan) and the coordinator opens the conference bridge. All recovery activity takes place at the alternate site and systems with no connection back to the production organization. The goal is to recover all critical business systems within RPO and RTO parameters. In other words, the DR team is recovering the data needed by the business within the time limit. It’s not uncommon to have several business users remotely connect to recovered systems at the parallel site and perform select operations. All data entered into a parallel system is considered dummy data and will be scrubbed upon test completion. Does your data destruction and wipe policy or procedures cover test sites? Photo: StockSnap_M7OOW736UB
  • #11 Taking notes during a test saves your sanity later. Even quick comments such as, “archive recovery logs X - Y were corrupt”, will help when you get back to the office and prepare for the lessons learned. After the parallel test concludes the DR team notifies the coordinator and conference bridge of the final status. The coordinator concludes the event. A full status report will not be presented to the board until after the team returns and conducts a lessons learned which we’ll cover in a bit. << STORY: Philly >>
  • #12 Prime time, most encompassing and involved test possible. This may minimally affect normal business operations and personnel. The company will run business operations from the alternate site and systems for a defined period of time then fail back to official production systems. Typically, there are processes for back filling data. Very few companies perform a full interruption test. Most stop at a parallel test leaving the full interruption for a real emergency. Photo: StockSnap_QIXFALAOQ4
  • #13 Particularly with Parallel or Full-Interruption tests a celebration is well in order. More than likely most if not all of them have been working for the full recovery window. The DR testing team has undergone a stress filled, time compressed test of their individual competencies, the robustness of the company’s documentation, and emerged from the gauntlet honed and improved.The DR testing team should decompress, expense an amazing meal, and get some sleep.
  • #14 There are several days of time between notifying the coordinator of the final success or failure and the lessons learned meeting. Be brutally honest when it comes to a lessons learned. It may seem counter intuitive but the best job security comes from sharing. Come to the meeting or meetings prepared with facts and root cause analysis from your testing experience. No test is complete until the plan documentation is updated with the latest intel. At this point even a first time DR test team has significantly increased the core competencies. They have a deeper more intimate understanding of the complexity that is the business. The more often you test the better prepared you will be for a real disaster which will often be a swirling maelstrom of scenarios and chaos. Photo: StockSnap_YR89OQFMT1
  • #15 The last action of the coordinator and DR team is to collect and destroy any print copies of the DR plan. These hard copies do not reflect the new, current state of the DR plan and could create confusion or impede the recovery efforts. Photo: commons.wikimedia.org/wiki/File:Shredded.jpg
  • #16 A cheat sheet of sorts for your quick references. can be done frequently throughout the year. Low cost & time commitment scenarios affecting critical operations as defined in the business impact analysis (BIA) is the biggest contributor to recovery success in numbers 4, 5, and real recovery events puts results of #3 to work in mock recovery. Typically an annual event rare, even in mature shops. Backflow of data is often an issue
  • #17 Presentation template by SlidesCarnival Photographs by the author, Unsplash, and Stocksnap.io