• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,193
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
37
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. AnInformation Technology Wake-up Call Disaster Recovery Planning Impact to Capital Markets Technology & Data Center Critical Infrastructure TECHNICAL WHITE PAPER Assess and Mitigate Risk and Vulnerabilities to Business Continuity and Disaster Recovery In the New York Metropolitan Area Vincent Pelly Scott Haglund Sophie Pascal, Contributing Editor
  • 2. Table of Contents Executive Summary...................................................................................................................................1 What happens to Business when the lights go out?.................................................................................1 Intended Audience and Structure.............................................................................................................2 Keeping Business in Business....................................................................................................................3 The unanticipated hidden risks.................................................................................................................3 Lessons Learned........................................................................................................................................4 Review of past events are key to effective Disaster Recovery .................................................................4 Risks to the Critical Infrastructure ............................................................................................................6 Climate Conditions and Patterns ..............................................................................................................6 Seismic Activity and Risk...........................................................................................................................7 Electrical Distribution and the Power Grid ...............................................................................................8 Data Center Reliability Classification ......................................................................................................10 Best Practices..........................................................................................................................................11 For Infrastructure Design........................................................................................................................11 High Availability & Disaster Recovery.....................................................................................................11 RTO and RPO...........................................................................................................................................11 Expectations for Continuous Availability ................................................................................................12 Virtualization...........................................................................................................................................13 Replication and Network Bandwidth......................................................................................................14 Database Replication ..............................................................................................................................14 Types of Backup Recovery and Replication Architectures......................................................................15 Disaster Recovery Site Selection.............................................................................................................16 Summary and Recommendations...........................................................................................................17 Best Practices..........................................................................................................................................18 Business Continuity Management Framework.......................................................................................19 Elements of Business Recovery Planning................................................................................................20 FEMA Flood Maps...................................................................................................................................21 Appendix A..............................................................................................................................................22 FEMA Flood Hazard Mapping - HIGH......................................................................................................22 FEMA Flood Hazard Mapping - HIGH (cont’d) ........................................................................................23 FEMA Flood Hazard Mapping - LOW.......................................................................................................24
  • 3. FEMA Flood Hazard Mapping - LOW (cont’d).........................................................................................25 FEMA Flood Hazard Mapping - LOW (cont’d).........................................................................................26 Appendix B..............................................................................................................................................27 Natural Disaster Risk Profiles for Data Centers ......................................................................................27 Natural Disaster Risk Profiles for Data Centers (cont’d).........................................................................28 Appendix C..............................................................................................................................................29 East Coast Liquidity Venues....................................................................................................................29 Works Cited & References......................................................................................................................30 About Citihub..........................................................................................................................................31 About the Authors ..................................................................................................................................31
  • 4. 1 Executive Summary What happens to Business when the lights go out? In the aftermath of Hurricane Sandy, significant flooding to coastal areas caused a majority of the Northeastern United States to be left without commercial electricity. Many businesses lost power because their buildings were located in zones that were flooded with seawater and because the main electrical panels were located below the rising water level. Generators that supported data centers weren’t able to supply fuel because pumps were located in flooded basements. Firms that had not pre-purchased fuel or secured delivery contracts for their backup generators were unable to operate their data centers beyond fuel storage capacity, and firms that did pre-purchase fuel could not receive deliveries due to flooded roadways. Employees were unable to access their offices, critical staff members were unable to travel to offsite recovery locations because government mandates forbade access to roadways for non- essential personnel, and customers were unable to complete online transactions. The overall impact of Hurricane Sandy was evaluated at between $30 billion and $50 billion.1 The numerous failures to IT mission critical infrastructure brought immediate attention to some very important design flaws in Recovery plans and processes today. The design flaws identify that data center facilities are vulnerable, leaving Business exposed to outages it cannot afford. The objective of this white paper is to provide senior executives with an overview of Disaster Recovery preparedness as well as the potential risks and vulnerabilities that exist in critical infrastructure, specifically in the New York metropolitan area. It will also help senior executives to become aware of critical details that may not be covered in their current Disaster Recovery plans. We at Citihub believe in the importance of having an end-to-end Business Continuity solution that includes not only a tested and validated data center and infrastructure design, but also the ability to provide staff with remote access to the key applications needed to continue operations. The recommendations listed in this white paper outline high-level frameworks designed for addressing business systems redundancy. It will also demonstrate how to significantly reduce data loss by using various design principles and best practices to obtain the best Disaster Recovery system to support Business requirements. Although the target industry is financial services, this paper can serve as a primary reference for building the appropriate Disaster Recovery solution for any company, regardless of industry or geography. Finally, this paper will offer a long-term business case for addressing critical vulnerabilities as well as factors that senior executives should take into consideration when setting priorities regarding critical infrastructure. This will ensure Business Continuity and prevent loss of revenue in the event of another major outage. 1 http://online.wsj.com/article/SB10001424052970204712904578092663774022062.html?mod=googlenews_wsj
  • 5. 2 Intended Audience and Structure This white paper is intended to help senior management and senior-level executives of financial services institutions navigate the Business Continuity and Disaster Recovery landscape. It outlines successful implementation strategies and best practices, and assumes that readers have basic knowledge of networks and infrastructure, as well as awareness of the geographical specificity of their businesses. Citihub will examine how site selection, power, cooling, and inadequacies within the system recovery architecture can contribute to the data centers risk of downtime. The analysis will explore specific data center infrastructure vulnerabilities, and suggest recommendations and best practices that identify and remediate gaps within the infrastructure to minimize downtime and achieve the highest possible return on investment.
  • 6. 3 Keeping Business in Business The unanticipated hidden risks The technological ecosystem supporting financial markets relies heavily on centralized data centers, infrastructure and communication networks as the core processing engines of capital markets. Uninterrupted operations are critical to the daily operations of the financial services industry, serving e-commerce, market data and pricing, matching engines, settlements and other critical systems, transactions and data that enable sell-side and buy-side firms to maintain worldwide market liquidity. Firms are at risk when disruption to the IT infrastructure occurs; systems are down and information is unavailable, adversely impacting business operations. Financial markets including retail banks and institutional securities firms require reliable and consistent operations to support front and back office systems, particularly settlement and clearing firms that process open transactions and communications with customers, counterparties and third parties. Disruptions to daily operations can prevent the ability of financial institutions to manage liquidity, which can increase financial risk to their organizations. These are some of the business and technical drivers behind the design and implementation of robust Disaster Recovery plans that should be considered in priority when selecting proper backup sites and developing sound Recovery management processes. Examples of system outages that should be considered when designing business and system resiliency plans:  Isolated failures caused by software, hardware errors or recent system upgrades that were not fully tested  External outages to telecommunications and electrical feeds caused by inadvertent damage to primary lines  Loss of critical infrastructure and mechanical and electrical systems, as well as failure of backup systems to provide continues operations  Wide-spread outages caused by natural disasters and catastrophic events Immediate threats and consequences of not having a Disaster Recovery plan:  Loss of revenue and of customer confidence, and damage to the corporate brand and reputation can arise from the inability of clients to access systems and account information or execute transactions  Cost to restore operations to normal state; without proper planning and Disaster Recovery management, this can be expensive  Potential fines or fees can be imposed for non-compliance related to unprepared resiliency plans resulting from extended outages2 2 Dodd-Frank H.R. 4173 – 316 “(ii)establish and maintain emergency procedures, backup facilities, and a plan for disaster recovery”
  • 7. 4 Lessons Learned Review of past events are key to effective Disaster Recovery Business today has not fully internalized the significant findings of this paper dated almost ten years ago. During the past 12 years the East coast of the United States, in particular the Northeast and the New York metropolitan area, has experienced several widespread power outages related to extreme weather conditions that have greatly impacted technology infrastructure. These events confirm that our IT critical infrastructure is vulnerable to regional disruption (power outages, climate change and natural disasters) as demonstrated from the increase of wide scale and regional disruption over the past decade. In response to these events, IT executives have planned accordingly by revising Business Continuity plans and introducing alternative backup sites, such as tertiary sites in geographical regions that are outside the location of the primary corporate site. Within the financial services community, senior industry leaders along with the Federal Reserve Board, OCC and SEC issued in 2005 an interagency white paper3 that described best practices to strengthen the resiliency of U.S. financial services post 9/11. The paper stressed the critical importance of protecting the financial system from new risks associated with widespread outages by focusing on the following high-level Business Continuity objectives:  Rapid recovery and timely resumption of critical operations  Key staff to resume critical operations in one major operating location  Comprehensive testing that demonstrates effective internal and external continuity arrangements 3Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System, September 2005, www.sec.gov/rules/concept/34-46432.htm “Firms that play significant roles in critical financial markets should maintain sufficient geographically dispersed resources, including staff, equipment and data to recover clearing and settlement activities within the business day on which a disruption occurs. Firms may consider the costs and benefits of a variety of approaches that ensure rapid recovery from a wide-scale disruption. However, if a backup site relies largely on staff from the primary site, it is critical for the firm to determine how staffing needs at the backup site would be met if a disruption results in loss or inaccessibility of staff at the primary site.” - Federal Reserve White Paper on the Resiliency of the U.S. Financial System, 2005
  • 8. 5 The results of the Federal interagency white paper, as well as the analyses and discussions held with financial industry technology experts and practitioners, show that sound practices based on the above key points have resulted in the development and implementation of best practices regarding Business Continuity. It is understood industry wide that many firms at the time did not embrace the urgency of the report, mostly for cost considerations. But today they can no longer be ignored. On the strength of that interagency paper and in reviewing past and recent events, it is imperative that these key points be taken into consideration when designing and building the Disaster Recovery architecture:  Performing a top down assessment of critical business activities that are mapped to supporting IT systems and key staff members  Prioritise systems to recover first and assign required support staff for a potentially limited capability in recovery mode  Establishing a crisis management team who will coordinate activities and make prioritisation calls on the ground. Critical time is often lost in the decision making process to invoke a Disaster Recovery plan.  Having a solid Recovery plan around established backup site(s) for data centers and all key business staff that is separate from the core processing location.  Periodically test back-up systems and network connectivity, and perform application role swaps on a scheduled basis to ensure Recovery plans function properly. Comprehensive Disaster Recovery testing should be end-to-end and involve telecommunication firms, third-party service providers and securities exchanges, as well as vetting of the business process and the proper activation sequence for application systems. It should also serve to familiarize business users with operational procedures in unusual situations.
  • 9. 6 Risks to the Critical Infrastructure Climate Conditions and Patterns “NOAA estimated approximately $1 billion in damage that occurred in 2011 from 12-14 major events”4 - NOAA 2012 A significant concern when reviewing an organization’s primary and recovery site is the geographic vulnerability to severe weather. Using tools and resources available from FEMA, the National Oceanic and Atmospheric Administration5 , and historical weather patterns can provide data on locations that have had consistent damage due to severe weather. Below is a summary of the NOAA 2011 and 2012 National Events Map for the U.S. Significant U.S. Weather and Climate Events As outlined in the Uptime Institute Natural Disaster Risk Profiles6 , the summary of risk profiles located in Appendix B outlines the risks to data center sites geographically associated with severe weather. The impact to the data center in or near the storm path should expect disruption, as well as minor to severe infrastructure damage when subject to the following natural disasters:  Tornado  Hurricane  Earthquake  Ice Storm  Blizzard  Thunderstorm  Lightning  Flood For detailed FEMA flood maps of the New York Metro area, please refer to Appendix B. Listed are the primary locations of critical data centers serving financial services in Appendix A. 4 http://www.noaa.gov/extreme2011/index.html 5 NOAA National Climatic Data Center, State of the Climate: National Overview for Annual 2012, published online December 2012 from http://www.ncdc.noaa.gov/sotc/national/2012/13. 6 Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications 2011 2012
  • 10. 7 Seismic Activity and Risk Historically, earthquakes and seismic activity are a rare occurrence in the New York metropolitan area, with the exception of the 2011 Virginia Earthquake7 that produced tremors throughout the New York area. Although no damage or outages occurred during the 2011 event, it’s a best practice to evaluate seismic activity when selecting a primary and recovery site. The following graphs summarize historically the impact, magnitude and spread of Seismic activity for the U.S. and New York area. 8 U.S. & New York Area Seismic Hazard Map Source: USGS 7 http://en.wikipedia.org/wiki/2011_Virginia_earthquake 8 United States Geological Survey, http://earthquake.usgs.gov/earthquakes/states/new_york/hazards.php
  • 11. 8 Electrical Distribution and the Power Grid When planning for alternative Disaster Recovery backup locations, as well as performing a risk and vulnerabilities assessment on the primary site, another key area of concern relates to the location of the power utilities and the major interconnections of the power grid. This type of assessment becomes critical when planning for 2N9 redundancy for primary and secondary locations. In order to lower the risk of localized power outages, a full disclosure of the locations of power stations, substations and feeds to the facility, as well as the redundancy within the feeds, is necessary to determine where electrical power gaps may exist. 9 When referring to the data center utility feed, a 2N system contains double the amount needed that run separately with no single points of failure. U.S. Electrical Grid and Power Plants “The U.S. electric grid is a complex network of independently owned and operated power plants and transmission lines.” - NPR, Visualizing The U.S. Electric Grid Source: NPR
  • 12. 9 Data Center Components The critical items within a data center contain a number of systems that control and run the electrical and mechanical components necessary for successful operation. Many of these systems are tied into the Building Management Systems (BMS), and others are directly linked to IT monitoring systems. Within the past two years, the industry has taken the stance that both BMS and IT critical systems should be managed and monitored by a single system and reported via a holistic dashboard. These systems are part of the building envelope and each contain a set of core delivery mechanisms and risk profiles. During Hurricane Sandy, many of the critical systems, specifically the electrical and mechanical (M&E), were severely damaged due to the fact that storm surge water entered the basements and took down main electrical panels, water and fuel pumps, etc. Many data centers with generator fuel pumps located in basements had difficulty starting up backup generators, and in some cases fuel had to be manually delivered to generators located on higher floors (via the bucket brigade). In addition to the M&E systems, other common infrastructure dependencies required to maintaining operations during a recovery period are generally related to the operations of the telecommunications infrastructure. During a widespread outage it is critical that the telecommunications infrastructure remain intact across the United States. Firms can mitigate this risk by implementing resiliency through the use of circuit diversity and routing when establishing geographically dispersed facilities. Source: Citihub
  • 13. 10 Data Center Reliability Classification Several data center industry experts have defined reliability classifications for the data center infrastructure. The term reliability refers to a variety of subjects including availability, durability and quality, as to how the data center has been engineered. The following five performance-based metrics have been defined to classify the reliability of the data center based on the Building Industry Consulting Services International (BICSI) standard for IT systems10 . Class F0 Class F1 Class F2 Class F3 Class F4 Single Path without Alternate Power Source Single Path Single Path with Redundant Components Concurrently Maintainable Fault Tolerant  Class F0 support basic environmental and energy requirements of the IT functions without supplementary equipment  Capital cost avoidance is the major driver  There is a high risk of downtime due to planned and unplanned events  Class F0 facilities maintenance performed during non-scheduled hours, and downtime of several hours or even days has minimum impact on the mission  Critical power distribution system separate from the general use power systems would not exist  No back-up generator system  The system might deploy power conditioning or surge protective devices to allow the specific equipment to function adequately (utility grade power does not meet the basic requirements of critical equipment)  No for power or air conditioning  Class F1 support the basic environmental and energy requirements of the IT functions  There is high risk of downtime due to planned and unplanned events  Class F1 facilities maintenance can be performed during non-scheduled hours, and the impact of downtime is relatively low  The critical power distribution system would deploy a power conditioning device to allow the critical equipment to function adequately (utility grade power does not meet the basic requirements of critical equipment)  No redundancy of any kind would be used for power or air conditioning for a similar reason  Class F2 provide level of reliability higher than Class F1 to reduce the risk of downtime due to component failure  Class F2 facilities there is a moderate risk of downtime due to planned and unplanned events  Maintenance activities can typically be performed during unscheduled hours  The critical power system would need redundancy in those parts of the electrical distribution system that are most likely to fail  These would include any products that have a high parts count or moving parts, such as UPS, controls, air conditioning, generators or ATS  In addition, it may be appropriate to specify premium quality devices that provide longer life or better reliability  Class F3 provide additional reliability and maintainability to reduce the risk of downtime due to natural disasters, human-driven disasters, planned maintenance, and repair activities  Maintenance and repair activities will typically need to be performed during full production time with no opportunity for curtailed operations  Critical power system in a Class F3 facility must provide for reliable, continuous power even when major components (or, where necessary, major subsystems) are out of service for repair or maintenance  To protect against unplanned downtime, the power system must be able to sustain operations while a dependent component or subsystem is out of service  Class F4 eliminate downtime through the application of all tactics to provide continuous operation regardless of planned or unplanned activities  All recognizable single points of failure from the point of connection to the utility to the point of connection to the critical loads are eliminated  Systems are typically automated to reduce the chances for human error and are staffed 24×7  Rigorous training is provided for the staff to handle any contingency  Compartmentalization and fault tolerance are prime requirements for a Class F4 facility  Critical power system in a Class F4 facility must provide for reliable, continuous power even when major components (or, where necessary, major subsystems) are out of service for repair or maintenance  To protect against unplanned downtime, the power system must be able to sustain operations while a dependent component or subsystem is out of service 10 BICSI Standards for Data Centers, https://www.bicsi.org/default.aspx
  • 14. 11 Best Practices For Infrastructure Design High Availability & Disaster Recovery High Availability and Disaster Recovery are both concepts related to Business Continuity. But whereas Business Continuity applies to the whole business (including IT), HA & DR typically are more related to IT Continuity, as part of overall Business Continuity. High Availability solutions mainly address outages at a single site, while Disaster Recovery solutions mainly address sudden, site-wide disasters. High Availability and Disaster Recovery objectives and metrics are different. A highly available site provides resiliency from errors of the underlying platform and single points of failure. Availability encompasses reliability, recovery, and failure. One of the most common measures of availability is the percentage of time that a given system is active and working. The following table correlates the percentage of availability to calendar time equivalents. Acceptable Uptime Downtime Per day Downtime Per month Downtime Per year 99% 14.40 minutes 7 hours 3.65 days 99.9% 86.40 seconds 43 minutes 8.77 hours 99.99% 8.64 seconds 4 minutes 52.60 minutes 99.999% 0.86 seconds 26 seconds 5.26 minutes RTO and RPO RTO is the elapsed time from service interruption until service is restored. It answers the question: "How long can you be without service?" RTO represents a time limit that cannot be exceeded without facing severe consequences. A unified High Availability and Disaster Recovery approach would establish both an uptime objective and an RTO for each service. RPO, on the other hand, is the point of time represented by the data upon service resumption. It answers the question: "How old can the data be?"
  • 15. 12 Expectations for Continuous Availability Data Replication The two basic methods of data replication are synchronous and asynchronous. In general terms, synchronous capabilities are used for shorter distances, and asynchronous capabilities are used for longer distances. The method chosen depends on Business Recovery requirements. Synchronous replication ensures that a remote copy of the data, identical to the primary copy, is created at the time the primary copy is updated. In synchronous replication, an update operation is not considered done until completion is confirmed at both the primary and secondary site. An incomplete operation is rolled back at both locations, ensuring that the remote copy is always an exact mirror image of the primary. Asynchronous replication places data updates in a queue on the primary server. However, it does not wait for the update acknowledgments on the secondary server. So, all data that did not have time to be copied across the network on the secondary server are lost if the first server fails. Application data may be lost in this type of failure. Most companies cannot tolerate more than a few hours or even minutes of downtime without serious impact to the bottom line. Synchronous data replication may be the appropriate solution for companies seeking the fastest possible data recovery, minimal data loss, and protection against database integrity problems.
  • 16. 13 Virtualization Virtualization makes it possible to implement Disaster Recovery plans at a significantly lower cost. Since virtual machines are hardware-independent, any physical server can be used as a recovery target for any virtual machine. As virtualization also makes it possible to consolidate workloads onto fewer servers, organizations can significantly reduce the cost of hardware for Disaster Recovery by reducing the number of servers needed at the primary site. Many organizations have already embraced the benefits of virtualization, as it can add tremendous value to Disaster Recovery planning. Before virtualization, Disaster Recovery was often too expensive to implement, and many organizations chose only to protect the most critical applications. Consolidating multiple physical servers as virtual hosts significantly reduces the amount of physical servers that need to be recovered in the event of an outage.
  • 17. 14 Replication and Network Bandwidth Network bandwidth can also introduce challenges to data replication strategies. It’s important to understand the amount of changed data that can occur within a given period of time. Depending on the rate of changed data in a given system, one can determine the amount of bandwidth needed. This period of time is referred to as the replication latency window. The network bandwidth guideline below can assist with these calculations. Database Replication Database replication is similar to database mirroring. These solutions use production database transaction logs to maintain a current copy of the production database on a standby server. In the event of a server outage, the database replication software, automatically switches the standby database into the production database. There are traditionally no restrictions on where the databases can reside, provided that they can communicate with each other. Synchronous replication however, does have some drawbacks. It has a theoretical distance limitation of 200 kilometres (km) or 124 miles, but the practical distance limitation for a busy system could be as little as 50(km) or 30 miles. Estimated Hours To Replicate Capacity Network 20 GB 80 GB 120 GB 200 GB 300 GB 730 GB T1 42.33 169.31 253.97 423.28 634.92 1544.97 10Base-T LAN 6.50 26.01 39.01 65.02 97.52 237.31 DS3 / T3 1.50 6.02 9.03 15.05 22.57 54.93 100Base-T LAN 0.65 2.60 3.90 6.50 9.75 23.73 OC3 0.42 1.68 2.52 4.19 6.29 15.31 OC12 0.10 0.42 0.63 1.05 1.57 3.82
  • 18. 15 Types of Backup Recovery and Replication Architectures Choosing the best suited backup and recovery option for an organization can be challenging. Traditionally, businesses request little to no downtime when recovering from a disaster or other type of outage. Implementing these types of solutions may represent a sizable investment. Management will have to decide which recovery option best fits the organization’s needs, particularly in relation to risk assessment, compliance and other requirements, as outlined earlier in this paper. Single Site Backup and Recovery Multi-Site Asynchronous Data Replication Multi-Site Synchronous Data Replication Cloud Backup and Recovery  Backups and snapshots required for off-site storage must be created periodically  Data can only be as up- to-date as the last backup; daily, weekly or monthly.  Recovery is limited to the point in time of the last backup  Asynchronous replication is supported by disk arrays, networks and host based replication products  Changes to data are committed to the source first, then buffered or journaled and sent to the replication target(s)  It's designed to work over long distances and greatly reduces bandwidth requirements  This can introduce delays that are nearly instantaneous to several hours, dependant on network latency  There is also no guarantee that the secondary system will have the most recent copy of the data if the primary fails  Used primarily for high-end transactional applications that require instantaneous failover if the primary node fails.  With synchronous replication, data is written to the primary and secondary storage systems at the same time, and is not complete until it is acknowledged by both local and remote storage systems.  Synchronous replication requires considerable bandwidth, which makes it also more expensive.  Applications and data remain on-premises in this approach, with data being backed up into the cloud and restored onto on- premises hardware when a disaster occurs.  In other words, the backup in the cloud becomes a substitute for tape-based off-site backups.  Many backup software vendors now provide options to directly back up to popular cloud service providers such as AT&T, Amazon, Microsoft and Rackspace.
  • 19. 16 Disaster Recovery Site Selection During the process of assessing the type of backup recovery and replication architecture, one of the key critical components is the disaster recovery site selection. Using leading industry best practices, the following recommendations provide guidance during a disaster recovery data center site selection. In general, primary and backup sites should not be subjected to the same threat profile (severe weather risks, same power grid, and flood zones).  Disaster Recovery sites should be located a significant distance11 from the primary site  Proven practices suggest a minimum of 50 to 200 miles from the primary data center, though neither the SEC or FSA12 are specific to any mandates required  Leading Disaster Recovery practices indicate between 200 and 800 miles, provided there are no technical limitations imposed by solution architectures such as low latency / algorithmic trading, synchronous replication, and fiber channel distance limitations  Avoid flood prone areas, major airport flight paths, earthquake areas and ensure diversity of power feeds  Mitigate key man risk by ensuring labor pool resiliency (data center staff and application recovery resources) and creating appropriate documentation for cross regional training 11 2003 SEC guidelines on Disaster Recovery (http://www.sec.gov/news/studies/34-47638.htm 12 FSA BCM guide (http://www.fsa.gov.uk/pubs/other/bcm_guide.pdf 50 to 200 miles Google Earth Imagery 2013: Blue/Red pins (data centers), Red area (0 – 25 miles) / Yellow Area (25-200 Miles) marginal / Green (200-800 Miles)
  • 20. 17 Summary and Recommendations Target Focus Areas When performing an evaluation and assessment of IT critical infrastructure, certain issues should be addressed in order to properly frame and design a sound Business Recovery plan. The following interview questions can be used as a guide when assessing an environment: 1. Can the IT infrastructure be trusted to withstand a major disruption? 2. Has the resiliency of the Data Center, Network and Compute environment been proven? 3. Has a Disaster Recovery test been performed recently? Were the critical business applications included in the last test? What were the results? 4. Have the business requirements been mapped to the IT infrastructure via a top-down review? 5. Does management fully understand the regulatory ramifications of not adhering to sound business recovery plans? If those questions cannot be answered, then the business may be at risk of failure because of its inability to recover production systems. Citihub would recommend an end-to-end assessment of IT infrastructure, along with an in-depth review of business continuity plans. A detailed infrastructure assessment of the Disaster Recovery plan and processes should include the following:  A thorough review of the existing primary and backup data centers, as well as the network and compute infrastructure, and the Disaster Recovery plan designs and architecture  An assessment of critical backup systems and confirmation that generator fuel pumps are not located in high risk areas such as basement buildings in flood zones  Review schedules for regular backup exercises and confirmation of failover procedures; confirmation that critical power has been tested and generators are functioning with sufficient fuel levels.  A review of regional and local FEMA flood zone maps (US), or the international equivalent, to determine the level of acceptable risk for data centres and critical systems  An understanding of fuel delivery schedules and the assurance that contracts are in place for emergency fuel delivery, taking into consideration that hospitals and emergency facilities have priority for fuel deliveries  A review of the backup data center location, making sure that the site is outside the primary geographic area and on separate utility grids if possible.  The education of teams for preparedness, so they react proactively and at the appropriate time (not delay in switching to backup power in the middle of the event)  An evaluation of service provider backup plans to identify dependency risks  The evaluation of remote access procedures and support systems; confirmation of sufficient capacity to support key staff working remotely.
  • 21. 18 Best Practices To help spearhead a Business Continuity Management plan and a Disaster Recovery program, the following best practices can drive awareness of the critical nature of these processes as well as help senior management establish or revise existing plans and eliminate gaps.  Establish a planning group to develop resiliency designs and recovery strategies  Build management awareness by establishing Key Performance Indicators (KPI) for Disaster Recovery to include the following: - Status of previous Disaster Recovery events/tests with periodic reports to senior management - Other core IT competencies that are critical to Disaster Recovery planning - Periodic tests to verify implementation of the Disaster Recovery plan and reports about gaps and risks - A review process that includes the deployment of new solutions  Perform Risk Assessments and Audits that will: - Complete top-down inventory assessment of all critical assets required to sustain operations - Review process structure assessments, audits, and reports - Assess gap and risks from previous events or audits - Create implementation plan to eliminate gaps - Document Disaster Recovery plan actions and escalation procedures - Build comprehensive training material - Develop test verification criteria and procedures  Separate people from technology and confirm business processes that require onsite staff to resume operations  Establish real remote access strategy for staff who are unable to commute during severe weather conditions
  • 22. 19 Business Continuity Management Framework Source: Citihub Business Continuity Management and Disaster Recovery Framework
  • 23. 20 Elements of Business Recovery Planning The business process assessment for determining critical areas of recovery begins with a top- down review as shown below. This approach confirms the technical infrastructure and dependencies associated with each business process. The above process enables end-to-end mapping of dependencies critical to providing an understanding of the key components that make up an application system. In order to determine business unit IT needs, and provide a gap analysis against IT capabilities, Citihub has developed a business impact analysis methodology on critical processes and the IT systems which support them. The three areas of focus are: Business Unit Overview Process Summary Application Requirements Summary The business unit overview and readiness heat map is used to capture business process criticality and IT capability readiness in the event of a catastrophic outage The key process summary examines business processes and rates the impact of a sustained outage on the business on three dimensions: Operational Impact, Financial Impact and Reputational Impact The application requirements gap analysis section summarizes the applications each business unit requires and provides a RAG status when compared against IT capabilities Source: Citihub Source: Citihub Business Impact Analysis Methodology
  • 24. 21 FEMA Flood Maps Appendix B illustrates one of the more critical vulnerabilities that exist within the New York metropolitan area. The storm surge during Hurricane Sandy13 , which caused major flooding in parts of the region, impacted critical systems in the core BMS and data center M&E, as well as transportation infrastructures in and out of New York City and the Tri-State area. The maps are ranked high to low by impact due to flooding and storm surge severity. Rank Risk Impact Mitigation LOW  No impact due to storm surge  None  Ensure redundancy site is active and tested MEDIUM  Storm surge impact can occur but unlikely  Partial or no building damage and/or access to main entrance  Ensure redundancy site is active and tested  Recovery plans activated HIGH  Storm surge impact is severe  Damage to main electrical switch gear and/or generators or fuel pumps  Ensure redundancy site is active and tested  Recovery plans activated  Staff plan activated 13 http://www.nhc.noaa.gov/refresh/graphics_at3+shtml/030345.shtml?gm_esurge
  • 25. 22 Appendix A FEMA Flood Hazard Mapping - HIGH New York Locations: Lower Manhattan and 55 Water Street New York Locations: 25 Broadway and 32 Ave of the Americas New Jersey Locations: 410 Commerce Blvd. and 760 Washington Ave.
  • 26. 23 FEMA Flood Hazard Mapping - HIGH (cont’d) New Jersey Locations: 545 Washington Blvd. and 755 Secaucus Road New Jersey Locations: 15 Enterprise Ave. North and 300 Boulevard East
  • 27. 24 FEMA Flood Hazard Mapping - LOW New York Locations: 111 8th Ave. and 360 Hamilton Ave., White Plains New York Locations: 480 North Bedford Road, Chappaqua and 11 Skyline Drive, Hawthorne
  • 28. 25 FEMA Flood Hazard Mapping - LOW (cont’d) New Jersey Locations: 1400 Federal Blvd. and 3003 Woodbridge Ave. New Jersey Locations: 165 Halsey Street and 100 Delawanna Ave
  • 29. 26 FEMA Flood Hazard Mapping - LOW (cont’d) Chicago Locations: 350 East Cermak, Chicago, IL and 2905 Diehl Road, Aurora IL
  • 30. 27 Appendix B Natural Disaster Risk Profiles for Data Centers Type On-Site Off-Site Impact Tornado In or near the storm path, expect disruption and minor to severe infrastructure damage In or near the storm path, expect disruption and minor to severe infrastructure damage  Advanced warning of tornado potential but no site specific warning  Employees remain at site  Duration is brief although intense  Roof and outside equipment (cooling towers, etc.) damaged or destroyed  Potential damage to the building structure  Loss of local utility and communications Hurricane In or near the storm path, expect disruption and minor to severe infrastructure damage Expect severe region-wide damage to public infrastructure, utilities and communications  Significant advanced warning  Duration is hours to a few days  Employees may require evacuation from site  Post-storm security may be required  Emergency supplies needed for at least several days  Roof and outside equipment (cooling towers, etc.) damaged or destroyed  Potential damage to the building structure  Loss of local utility and communications  Repair to regional damage may require days, weeks or longer for massive reconstruction of electric power transmission or distribution facilities  Potential for off-sit public infrastructure damage Earthquake Expect catastrophic damage and disruption to data centers near the epicenter and infrastructure damage to data centers further away Expect severe region-wide damage to public infrastructure, utilities and communications  No warning  Brief duration with the threat of continued aftershocks  Employees may be unable to leave site  Emergency supplies needed for several days of operation  Building structural damage  Toppling of un-braced computer hardware and site infrastructure equipment including collapse of raised floor  Site may be isolated for an extended period  Highways and bridges may be damaged or destroyed preventing movement of diesel fuel and other operating supplies required for continues operation  Power and communications may sustain extensive damage requiring days, weeks or longer to repair Source: Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications
  • 31. 28 Natural Disaster Risk Profiles for Data Centers (cont’d) Type On-Site Off-Site Impact Ice Storm / Blizzard Expect some disruption or failure of data center if outside equipment is not designed to survive severe ice and snow accumulation` Expect severe region-wide damage to public infrastructure, utilities and communications  Several days warning generally expected  Storm or multiple storms may last several days with accumulative effects  Employees may be unable to leave or enter site  Emergency supplies needed for at least several days  Ice damage to structure and outside equipment  Roof failure from excessive snow load  Potential freezing of pipes  Loss of overhead power and /or communications lines over large areas may require several days, weeks or longer to repair  Roads dangerous or impassable Thunderstorm / Lightning Expect disruption ranging from disaster to no impact depending on distance to lightning strike and proper operation of surge suppression, UPS, and engine- generator systems Expect frequent momentary public utility disruptions from lightning strikes hitting the electric power transmission grid  Special sensors can provide minutes of storm approach warning  Duration is brief but may recur daily during thunderstorm season  Frequent UPS battery discharges shorten remaining battery life  Extended power interruption if utility service is overhead or radial and a nearby lightning strike causes protective devices to open  Possible flooding and roof leakage  Momentary under voltages can affect hundreds of square miles  Fires started by lightning can destroy public infrastructure located in rural areas Flood Expect catastrophic damage and disruption to data centers in severe flood areas or with infrastructure systems below grade Expect severe region-wide damage to public infrastructure, utilities and communications  Several day warning generally expected  Employees may be unable to leave site  Emergency supplies needed for at least several days operation  Site infrastructure damage requiring days to weeks to repair  Site may be isolated for an extended period  Highways and bridges may be damaged preventing movement of diesel fuel and other operating supplies required for continues operation  Power and communications may sustain extensive damage requiring days, weeks or longer to repair Source: Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications
  • 32. 29 Appendix C East Coast Liquidity Venues The New York metro area is responsible for approximately 94% of the volume of shares traded for the US cash equity market14 . The following maps illustrate the major liquidity venues in the New York and Chicago metropolitan locations. New York Chicago 14 http://www.batstrading.com/market_data/daily_volume/
  • 33. 30 Works Cited & References Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System, September 2005. http://www.sec.gov/news/studies/34-47638.htm Dodd-Frank H.R. 4173 Wall Street Reform and Consumer Protection Act, January 2010 National Public Radio (NPR), Visualizing The U.S. Electric Grid, April 24 th 2009, www.npr.org/templates/story/story.php?storyId=110997398 NOAA National Climatic Data Center, State of the Climate: National Overview for Annual 2012, published online December 2012 from http://www.ncdc.noaa.gov/sotc/national/2012/13. Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications United States Geological Survey, http://earthquake.usgs.gov/earthquakes/states/new_york/hazards.php BICSI Standards for Data Centers, https://www.bicsi.org/default.aspx Colocation Selection, best practices and critical considerations for choosing the right data center colocation solution. Bill Kleyman, Cloud and Virtualization Architect, October 2012 Climate Change and Infrastructure, Urban Systems, and Vulnerabilities, Technical Report for the U.S. Department of Energy in Support of the National Climate Assessment, February 29, 2012 The historic nor’easter of 13-14 March 2010, Richard H. Grumm, National Weather Service
  • 34. 31 About Citihub Founded in 1998, Citihub provides IT expertise to some of the world’s leading enterprise organizations and is comprised of industry veterans who relish the challenge of complex technology and cultural change. We take a fresh approach to the technical challenges of today and believe in partnering with our clients through change. Citihub clients include Investment Banks, Hedge Funds, Media, and Manufacturing. About the Authors Vincent Pelly Vincent Pelly is an Associate Partner at Citihub with more than 30 years of experience across the financial services industry with specialization in infrastructure, program management and IT strategy. He has extensive experience managing large enterprise projects in infrastructure and data center advisory and technology implementation, and has managed large infrastructure transformation programs. Scott Haglund Scott Haglund is an independent consultant with more than 30 years of experience in the development and execution of global infrastructure strategy, architecture, transformation, technology roadmaps, optimization, and service delivery standards for the enterprise. He specializes in data center automation strategies, and has led many enterprise infrastructure transformation programs.