Print Document                                             
Print Document                                                  
Print Document                                                 
Print Document                                             
Print Document                                             
Print Document                                                           
Print Document of 7                                      ...
Upcoming SlideShare
Loading in …5

Best practices in dr management and testing


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Best practices in dr management and testing

  1. 1. Print Document This research note is restricted to the personal use of Aristotle Castro ( Best Practices for Planning and Managing Disaster Recovery Testing 16 August 2011 | ID:G00215785 John P Morency Annual costs for disaster recovery testing can be as high as $150,000. Solutions for discovering and mapping software and data dependencies among Web-based applications is likely to become essential for DR testing/exercising, as part of an organizations best practices. Overview The time and resource costs of disaster recovery (DR) plan exercising, especially that which is supported by manual or semimanual processes, has become the most significant IT DR management (IT-DRM) pain point for many of Gartners clients. Specific steps can be taken and technologies can be deployed to reduce recovery plan testing costs and complexities. Key Findings The annual costs of DR testing can reach or exceed $150,000 for many Gartner clients. These costs could go even higher, as new business applications are rolled into production. Tools capable of discovering and mapping software and data dependencies between Web-based applications are likely to become essential for managing efficient and effective recovery testing/exercising. The need for more thorough business application inquiry and transaction testing will drive enterprises to assess organizational and test management consolidation and integration to more efficiently scale recovery testing in the future. Recommendations Evaluate IT service dependency mapping technologies from vendors such as BMC Software (Tideway), CA Technologies, HP, IBM, Neebula, ServiceNow and VMware to assess the extent to which they can simplify the testing process and make it more reliable. Pilot software change management tools (from vendors such as BMC Software, CA Technologies; HP, IBM Maximo, SAP and ServiceNow) and procedures that have the potential to most effectively synchronize change implementation between primary production and secondary recovery data centers. Evaluate the possible savings that can be gained by consolidating the application testing resources, processes and tools used by the DR and quality assurance (QA) testing teams.1 of 7 9/23/12 4:07 PM
  2. 2. Print Document Analysis DR testing is critical for supporting business resiliency. However, as the scope of mission- critical business processes, applications and data increases, sustaining the quality and thoroughness of the test process can be a challenge. Gartner client recovery and continuity-specific inquiries indicate that many enterprises are now implementing new approaches for managing recovery exercising, mostly because of the increasing cost and logistical complexity of traditional approaches. Gartner research shows the importance of effectively managing recovery exercising costs. In one study of the exercising costs of federal government agencies (see "Cost-Cutting IT: Should You Cut Back Your Disaster Recovery Exercise Spending?"), clients reported that IT-DRM annual exercise budget allocations ranged from $20,000 to more than $150,000, depending on the size, location, number of participants, scope of exercise and organizational structure of the governmental unit. Results from nongovernment client inquiries have shown that it isnt unusual for the annual cost of DR exercising to be between $75,000 and $150,000. Gartner has identified some of the key reasons enterprises find DR testing increasingly difficult and/or costly: Increasingly complex dependencies — Web applications and services often have logically meshed relationships with, and dependencies on, other applications and data, some of which is often part of a lower recovery tier (see Table 1). Inconsistencies — These occur between the current state of the data center infrastructure, applications and data, and their state at the time of the last recovery test. This may affect the extent to which production applications and data can be successfully recovered, unless robust change and configuration management processes (and tools) are in place. For example, a monthly volume of even a few hundred changes to a data centers OS, middleware, applications or management agents can result in a difference of thousands of changes between the current production configuration and the production configuration at the time of the last recovery test. Lack of resources — With the increasingly complex scope of testing, enterprises rarely have adequate recovery testing resources to exercise all production application inquiries and transactions on a regular basis. Some organizations test only their most mission-critical applications. Others rotate testing among applications, while still others focus on systems that have failed previous tests. A frequent result is that lower-priority applications are tested far less frequently, and their recoverability is qualified as being "… on a best effort basis." Table 1. Recovery Tiers Tier Service Levels 1 24/7 scheduled 99.9% availability (less than 45 minutes/month) Recovery time objective (RTO) = two to eight hours; recovery point objective (RPO) = four hours 2 24/6 3/4 scheduled 99.5% availability (less than 3.5 hours per month) RTO = eight to 24 hours; RPO = four hours2 of 7 9/23/12 4:07 PM
  3. 3. Print Document Tier Service Levels 3 18/7 scheduled 99% availability (less than 5.5 hours per month) RTO = one to three days; RPO = one day 4 24/6 1/2 scheduled 98% availability (less than 413.5 hours per month) RTO = more than three days; RPO = one day Source: Gartner (August 2011) In light of these challenges, Gartner is increasingly seeing clients rethink their test strategies and implement a series of best practices. Establishing a Minimum Acceptable Level of Recovery Testing The 2011 Gartner Risk Management Survey shows that enterprises test recoverability, on average, once or twice a year. However, anecdotal evidence — based on more than 3,000 DR-related Gartner client inquiries in a three-year period — suggests that fewer and fewer of these live tests involve all production applications and data. Instead, tests are specific to an individual recovery tier (typically, the recovery tier corresponding to the most mission- critical applications) or include an affinity group of production applications that have related software and data dependencies. This means that many organizations follow the 80/20 rule — 80% of the testing is done on the applications that are the most mission- critical (which are often 20% or less of the total number of production applications). Despite this data, however, you shouldnt completely ignore test procedures for less critical applications and data. Rather, IT must ensure the recovery of the business processes and supporting applications, the loss of which would cause the greatest loss of revenue, productivity or organizational reputation. In terms of how often an organization should conduct testing, we offer the following baselines, again subject to your organizations special circumstances: Conduct live testing for Tier 1 and Tier 2 applications and data at least twice per year. Initiate more frequent (monthly, quarterly) manual or (ideally) automated testing on application affinity groups. Perform failover and failback testing during the same or separate planned downtime periods. Ensure that the required data restoration and application activation cycle times meet or beat the RTO and RPO targets. Regardless of how you determine recovery tier definitions, it is important to begin thinking about how you can best test recoverability, especially for the most mission-critical application data. Test more frequently the related applications and data that support a smaller set of key business processes, and shift the testing focus to how IT can best meet or beat the associated recovery targets. Pain Point Remediation Alternatives Automated Dependency Mapping The challenge of ensuring that all required software and data dependencies are addressed in a recovery configuration will become more complex, as new business applications that have been purchased, created by in-house development teams, or acquired through3 of 7 9/23/12 4:07 PM
  4. 4. Print Document merger and acquisition (M&A) activity are turned over to production. Increasingly mature IT service dependency mapping tools can help. These products, available from vendors such as BMC Software, CA Technologies, HP and IBM, enable IT organizations to discover, document and track relationships by mapping dependencies among the infrastructure components, such as servers, networks, storage and applications, that form an IT service (see "IT Service Dependency Mapping Tools: Market Dynamics Update"). These tools are used primarily for applications, servers and databases; however, a few discover network devices (such as switches and routers), mainframe-unique attributes and virtual infrastructures, thereby presenting a complete service map. Although these tools are often bought in conjunction with configuration management database (CMDB) projects, we have seen a significant increase in their acquisition and use for data center-specific projects, such as IT-DRM modernization and data center consolidation. Data dependency mapping products from 21st Century Software, AppAssure, Bocada, Continuity Software, InMage and Sanovi are software products that provide automated data, metadata and index consistency assurance between production files and databases and their replicas that are maintained at one or more recovery sites. Background software agents determine and report on the likelihood of achieving specified recovery targets, based on analyzing and correlating data from applications, databases, clusters, OSs, virtual systems, networking and storage replication mechanisms. These products perform their consistency checking on data located on direct-attached storage (DAS), storage- area-network (SAN)-connected storage or network-attached storage (NAS) at the primary production and secondary recovery data centers. Synchronizing Distributed Change Ensuring 100% change consistency between the production data center configuration, applications and data and their recovery data center counterparts is a challenging task. At a minimum, the recovery infrastructure at the secondary site must be dedicated, although this may not be the case for the recovery facility itself. Typically, asynchronous data replication (either host- or storage controller-based) and server virtualization are used to support a partial or full development and testing configuration that is used by in-house application development, support and testing teams during normal production hours. In this scenario, synchronizing changes between the primary production and development and test (which can or might support recovery) configurations is typically managed by the development and testing teams, in conjunction with operations support. This may involve the automated replication of updated production virtual server images to the secondary configuration, in parallel or in tandem with production data replication. Several product options support virtual server replication, including offerings from such vendors as Acronis, Asigra, Atempo, BakBone Software, CA Technologies, CommVault, Double-Take Software, EMC, FalconStor Software, HP, i365, IBM, InMage, Microsoft, NetApp, Novell, PHD Virtual, Quest Software, Symantec, Syncsort and Veeam. However, for recovery configurations that include a mix of physical and virtual servers, as well as a combination of shrink-wrapped and in-house-developed applications, the use of IT process automation tools that orchestrate infrastructure configuration, provisioning and change updating is likely to be required. (Further information on the current state of IT process automation, change and configuration management can be found in "Hype Cycle for IT Operations Management, 2011.") Consolidating Testing Personnel, Tools and Skill Sets One approach that has met with some client success is consolidating what were previously separate QA and recovery testing teams into a single organization. Organizational4 of 7 9/23/12 4:07 PM
  5. 5. Print Document consolidation, together with the consolidation and standardization and testing platforms and scripts, is an approach that can be used to support preproduction turnover regression, as well as ongoing DR, testing. Organizations that implemented this approach did so to address a lack of recovery testing breadth and depth. Given the increasing numbers of mission-critical applications requiring recovery, as well as the related numbers of inquiries and transactions, it became clear that manual or semimanual testing processes could only provide limited recovery assurance. This was because the extent to which a full set of production inquiries and transactions could be consistently exercised by the recovery exercising team was limited by testing time constraints. In one specific instance, a recovery team was able to meet the required RTO and RPO targets for the most mission-critical applications, but the recovery of the production environment, as perceived by the business unit end users, was short-lived, because undiscovered (and, therefore, unaddressed) software and data dependencies resulted in several inquiries and transactions prematurely aborting or incurring unacceptably long response times. The net result was that the recovery team won the battle by supporting the required RTOs and RPOs, but lost the war, because the usability and effectiveness of the recovery operations configuration was limited. A new approach was needed that could not only improve the breadth and depth of application testing coverage, but could increase the efficiency and effectiveness of recovery exercising as a whole. Following an assessment of the technical benefits and cost savings that could result from a merger of the internal QA and the DR testing teams, a decision was made to consolidate them into a single organization and to standardize the management and automation of test processes by leveraging many of the tools, scripts and staff resources that were already in place. The benefits that have been realized by some of the early adopters of this approach include increasingly reliable and more-effective test exercises, combined with more-thorough testing of representative production inquiries and transactions against the recovery configuration. The latter improves the likelihood that recovery operations can be initiated within required RTO and RPO targets, and ensures more stable recovery operations. Summary IT-DRM managers may recognize one or more of these approaches as potentially adding value to their IT-DRM programs. Regardless of which side of the issue you see your organization leaning toward, it is important to consider the key technologies your organization uses, because, for many organizations, the use of more traditional recovery testing and technology that helps manage more sustained availability may not be so much a case of "either/or" in the next five years, but rather a case of "and." Recommended Reading Some documents may not be available as part of your current Gartner subscription. "Hype Cycle for Business Continuity Management and IT Disaster Recovery Management, 2011" "From Development to Production: Integrating Change, Configuration and Release" "Predicts 2011: Improved Recoverability May Be on the Horizon, but Significant Challenges Remain" "Data Center Conference Poll Findings: Disaster Recovery Testing Mistakes"5 of 7 9/23/12 4:07 PM
  6. 6. Print Document "Cost-Cutting IT: Should You Cut Back Your Disaster Recovery Exercise Spending?" "Toolkit: Best Practices for a Successful Tabletop Recovery Test" "Hype Cycle for IT Operations Management, 2011." "IT Service Dependency Mapping Tools: Market Dynamics Update" Strategic Planning Assumption By the end of 2014, 15% of enterprises will have significantly reduced or eliminated traditional DR testing as a result of supporting more resilient IT operations. © 2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved. Reproduction and distribution of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Although Gartners research may discuss legal issues related to the information technology business, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The opinions expressed herein are subject to change without notice.6 of 7 9/23/12 4:07 PM
  7. 7. Print Document of 7 9/23/12 4:07 PM