Data Center Critical Infrastructure Risk and Vulnerabilities- Impact to Capital Markets

2,692 views

Published on

We at Citihub believe in the importance of having an end-to-end Business Continuity solution that includes not only a tested and validated data center and infrastructure design, but also the ability to provide staff with remote access to the key applications needed to continue operations.
Our recently published white paper provides senior executives with an overview of Disaster Recovery preparedness as well as outlining the potential risks and vulnerabilities that exist in critical infrastructure, specifically in the New York metropolitan area. Read the technical white paper “Data Center Infrastructure Risk and Vulnerabilities” to become aware of critical details that may not be covered in your business Disaster Recovery plans.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,692
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
120
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data Center Critical Infrastructure Risk and Vulnerabilities- Impact to Capital Markets

  1. 1. AnInformation Technology Wake-up CallDisaster Recovery PlanningImpact to Capital Markets Technology& Data Center Critical InfrastructureTECHNICAL WHITE PAPERAssess and Mitigate Risk and Vulnerabilities to Business Continuity and Disaster RecoveryIn the New York Metropolitan AreaVincent PellyScott HaglundSophie Pascal, Contributing Editor
  2. 2. Table of ContentsExecutive Summary...................................................................................................................................1What happens to Business when the lights go out?.................................................................................1Intended Audience and Structure.............................................................................................................2Keeping Business in Business....................................................................................................................3The unanticipated hidden risks.................................................................................................................3Lessons Learned........................................................................................................................................4Review of past events are key to effective Disaster Recovery .................................................................4Risks to the Critical Infrastructure ............................................................................................................6Climate Conditions and Patterns ..............................................................................................................6Seismic Activity and Risk...........................................................................................................................7Electrical Distribution and the Power Grid ...............................................................................................8Data Center Reliability Classification ......................................................................................................10Best Practices..........................................................................................................................................11For Infrastructure Design........................................................................................................................11High Availability & Disaster Recovery.....................................................................................................11RTO and RPO...........................................................................................................................................11Expectations for Continuous Availability ................................................................................................12Virtualization...........................................................................................................................................13Replication and Network Bandwidth......................................................................................................14Database Replication ..............................................................................................................................14Types of Backup Recovery and Replication Architectures......................................................................15Disaster Recovery Site Selection.............................................................................................................16Summary and Recommendations...........................................................................................................17Best Practices..........................................................................................................................................18Business Continuity Management Framework.......................................................................................19Elements of Business Recovery Planning................................................................................................20FEMA Flood Maps...................................................................................................................................21Appendix A..............................................................................................................................................22FEMA Flood Hazard Mapping - HIGH......................................................................................................22FEMA Flood Hazard Mapping - HIGH (cont’d) ........................................................................................23FEMA Flood Hazard Mapping - LOW.......................................................................................................24
  3. 3. FEMA Flood Hazard Mapping - LOW (cont’d).........................................................................................25FEMA Flood Hazard Mapping - LOW (cont’d).........................................................................................26Appendix B..............................................................................................................................................27Natural Disaster Risk Profiles for Data Centers ......................................................................................27Natural Disaster Risk Profiles for Data Centers (cont’d).........................................................................28Appendix C..............................................................................................................................................29East Coast Liquidity Venues....................................................................................................................29Works Cited & References......................................................................................................................30About Citihub..........................................................................................................................................31About the Authors ..................................................................................................................................31
  4. 4. 1Executive SummaryWhat happens to Business when the lights go out?In the aftermath of Hurricane Sandy, significant flooding to coastal areas caused a majority ofthe Northeastern United States to be left without commercial electricity. Many businesses lostpower because their buildings were located in zones that were flooded with seawater andbecause the main electrical panels were located below the rising water level. Generators thatsupported data centers weren’t able to supply fuel because pumps were located in floodedbasements. Firms that had not pre-purchased fuel or secured delivery contracts for their backupgenerators were unable to operate their data centers beyond fuel storage capacity, and firmsthat did pre-purchase fuel could not receive deliveries due to flooded roadways. Employeeswere unable to access their offices, critical staff members were unable to travel to offsiterecovery locations because government mandates forbade access to roadways for non-essential personnel, and customers were unable to complete online transactions. The overallimpact of Hurricane Sandy was evaluated at between $30 billion and $50 billion.1The numerous failures to IT mission critical infrastructure brought immediate attention to somevery important design flaws in Recovery plans and processes today. The design flaws identifythat data center facilities are vulnerable, leaving Business exposed to outages it cannot afford.The objective of this white paper is to provide senior executives with an overview of DisasterRecovery preparedness as well as the potential risks and vulnerabilities that exist in criticalinfrastructure, specifically in the New York metropolitan area. It will also help senior executives tobecome aware of critical details that may not be covered in their current Disaster Recovery plans.We at Citihub believe in the importance of having an end-to-end Business Continuity solutionthat includes not only a tested and validated data center and infrastructure design, but also theability to provide staff with remote access to the key applications needed to continue operations.The recommendations listed in this white paper outline high-level frameworks designed foraddressing business systems redundancy. It will also demonstrate how to significantly reducedata loss by using various design principles and best practices to obtain the best DisasterRecovery system to support Business requirements.Although the target industry is financial services, this paper can serve as a primary reference forbuilding the appropriate Disaster Recovery solution for any company, regardless of industry orgeography.Finally, this paper will offer a long-term business case for addressing critical vulnerabilities aswell as factors that senior executives should take into consideration when setting prioritiesregarding critical infrastructure. This will ensure Business Continuity and prevent loss ofrevenue in the event of another major outage.1http://online.wsj.com/article/SB10001424052970204712904578092663774022062.html?mod=googlenews_wsj
  5. 5. 2Intended Audience and StructureThis white paper is intended to help senior management and senior-level executives of financialservices institutions navigate the Business Continuity and Disaster Recovery landscape. Itoutlines successful implementation strategies and best practices, and assumes that readershave basic knowledge of networks and infrastructure, as well as awareness of the geographicalspecificity of their businesses.Citihub will examine how site selection, power, cooling, and inadequacies within the systemrecovery architecture can contribute to the data centers risk of downtime. The analysis willexplore specific data center infrastructure vulnerabilities, and suggest recommendations andbest practices that identify and remediate gaps within the infrastructure to minimize downtimeand achieve the highest possible return on investment.
  6. 6. 3Keeping Business in BusinessThe unanticipated hidden risksThe technological ecosystem supporting financial markets relies heavily on centralized datacenters, infrastructure and communication networks as the core processing engines of capitalmarkets. Uninterrupted operations are critical to the daily operations of the financial servicesindustry, serving e-commerce, market data and pricing, matching engines, settlements andother critical systems, transactions and data that enable sell-side and buy-side firms to maintainworldwide market liquidity.Firms are at risk when disruption to the IT infrastructure occurs; systems are down andinformation is unavailable, adversely impacting business operations. Financial markets includingretail banks and institutional securities firms require reliable and consistent operations tosupport front and back office systems, particularly settlement and clearing firms that processopen transactions and communications with customers, counterparties and third parties.Disruptions to daily operations can prevent the ability of financial institutions to manage liquidity,which can increase financial risk to their organizations.These are some of the business and technical drivers behind the design and implementation ofrobust Disaster Recovery plans that should be considered in priority when selecting properbackup sites and developing sound Recovery management processes.Examples of system outages that should be considered when designing business and systemresiliency plans: Isolated failures caused by software, hardware errors or recent system upgrades that werenot fully tested External outages to telecommunications and electrical feeds caused by inadvertent damageto primary lines Loss of critical infrastructure and mechanical and electrical systems, as well as failure ofbackup systems to provide continues operations Wide-spread outages caused by natural disasters and catastrophic eventsImmediate threats and consequences of not having a Disaster Recovery plan: Loss of revenue and of customer confidence, and damage to the corporate brand andreputation can arise from the inability of clients to access systems and account informationor execute transactions Cost to restore operations to normal state; without proper planning and Disaster Recoverymanagement, this can be expensive Potential fines or fees can be imposed for non-compliance related to unprepared resiliencyplans resulting from extended outages22Dodd-Frank H.R. 4173 – 316 “(ii)establish and maintain emergency procedures, backup facilities, and a plan for disaster recovery”
  7. 7. 4Lessons LearnedReview of past events are key to effective Disaster RecoveryBusiness today has not fully internalized the significant findings of this paper dated almost tenyears ago.During the past 12 years the East coast of the United States, in particular the Northeast and theNew York metropolitan area, has experienced several widespread power outages related toextreme weather conditions that have greatly impacted technology infrastructure. These eventsconfirm that our IT critical infrastructure is vulnerable to regional disruption (power outages,climate change and natural disasters) as demonstrated from the increase of wide scale andregional disruption over the past decade.In response to these events, IT executives have planned accordingly by revising BusinessContinuity plans and introducing alternative backup sites, such as tertiary sites in geographicalregions that are outside the location of the primary corporate site. Within the financial servicescommunity, senior industry leaders along with the Federal Reserve Board, OCC and SECissued in 2005 an interagency white paper3that described best practices to strengthen theresiliency of U.S. financial services post 9/11. The paper stressed the critical importance ofprotecting the financial system from new risks associated with widespread outages by focusingon the following high-level Business Continuity objectives: Rapid recovery and timely resumption of critical operations Key staff to resume critical operations in one major operating location Comprehensive testing that demonstrates effective internal and external continuityarrangements3Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System, September 2005,www.sec.gov/rules/concept/34-46432.htm“Firms that play significant roles in critical financial markets should maintain sufficientgeographically dispersed resources, including staff, equipment and data to recover clearingand settlement activities within the business day on which a disruption occurs. Firms mayconsider the costs and benefits of a variety of approaches that ensure rapid recovery from awide-scale disruption. However, if a backup site relies largely on staff from the primary site, itis critical for the firm to determine how staffing needs at the backup site would be met if adisruption results in loss or inaccessibility of staff at the primary site.”- Federal Reserve White Paper on the Resiliency of the U.S. Financial System, 2005
  8. 8. 5The results of the Federal interagency white paper, as well as the analyses and discussionsheld with financial industry technology experts and practitioners, show that sound practicesbased on the above key points have resulted in the development and implementation of bestpractices regarding Business Continuity. It is understood industry wide that many firms at thetime did not embrace the urgency of the report, mostly for cost considerations. But today theycan no longer be ignored.On the strength of that interagency paper and in reviewing past and recent events, it isimperative that these key points be taken into consideration when designing and building theDisaster Recovery architecture: Performing a top down assessment of critical business activities that are mapped tosupporting IT systems and key staff members Prioritise systems to recover first and assign required support staff for a potentially limitedcapability in recovery mode Establishing a crisis management team who will coordinate activities and make prioritisationcalls on the ground. Critical time is often lost in the decision making process to invoke aDisaster Recovery plan. Having a solid Recovery plan around established backup site(s) for data centers and all keybusiness staff that is separate from the core processing location. Periodically test back-up systems and network connectivity, and perform application roleswaps on a scheduled basis to ensure Recovery plans function properly.Comprehensive Disaster Recovery testing should be end-to-end and involve telecommunicationfirms, third-party service providers and securities exchanges, as well as vetting of the businessprocess and the proper activation sequence for application systems. It should also serve tofamiliarize business users with operational procedures in unusual situations.
  9. 9. 6Risks to the Critical InfrastructureClimate Conditions and Patterns“NOAA estimated approximately $1 billion in damage that occurred in 2011 from 12-14 major events”4- NOAA 2012A significant concern when reviewing an organization’s primary and recovery site is thegeographic vulnerability to severe weather. Using tools and resources available from FEMA, theNational Oceanic and Atmospheric Administration5, and historical weather patterns can providedata on locations that have had consistent damage due to severe weather.Below is a summary of the NOAA 2011 and 2012 National Events Map for the U.S.Significant U.S. Weather and Climate EventsAs outlined in the Uptime Institute Natural Disaster Risk Profiles6, the summary of risk profileslocated in Appendix B outlines the risks to data center sites geographically associated withsevere weather. The impact to the data center in or near the storm path should expectdisruption, as well as minor to severe infrastructure damage when subject to the followingnatural disasters: Tornado Hurricane Earthquake Ice Storm Blizzard Thunderstorm Lightning FloodFor detailed FEMA flood maps of the New York Metro area, please refer to Appendix B. Listedare the primary locations of critical data centers serving financial services in Appendix A.4http://www.noaa.gov/extreme2011/index.html5NOAA National Climatic Data Center, State of the Climate: National Overview for Annual 2012, published online December 2012 fromhttp://www.ncdc.noaa.gov/sotc/national/2012/13.6Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications2011 2012
  10. 10. 7Seismic Activity and RiskHistorically, earthquakes and seismic activity are a rare occurrence in the New Yorkmetropolitan area, with the exception of the 2011 Virginia Earthquake7that produced tremorsthroughout the New York area. Although no damage or outages occurred during the 2011 event,it’s a best practice to evaluate seismic activity when selecting a primary and recovery site.The following graphs summarize historically the impact, magnitude and spread of Seismicactivity for the U.S. and New York area. 8U.S. & New York Area Seismic Hazard MapSource: USGS7 http://en.wikipedia.org/wiki/2011_Virginia_earthquake8 United States Geological Survey, http://earthquake.usgs.gov/earthquakes/states/new_york/hazards.php
  11. 11. 8Electrical Distribution and the Power GridWhen planning for alternative Disaster Recovery backup locations, as well as performing a riskand vulnerabilities assessment on the primary site, another key area of concern relates to thelocation of the power utilities and the major interconnections of the power grid. This type ofassessment becomes critical when planning for 2N9redundancy for primary and secondarylocations. In order to lower the risk of localized power outages, a full disclosure of the locationsof power stations, substations and feeds to the facility, as well as the redundancy within thefeeds, is necessary to determine where electrical power gaps may exist.9When referring to the data center utility feed, a 2N system contains double the amount needed that run separately with no single points offailure.U.S. Electrical Grid and Power Plants“The U.S. electric grid is a complex network of independently owned and operatedpower plants and transmission lines.”- NPR, Visualizing The U.S. Electric GridSource: NPR
  12. 12. 9Data Center ComponentsThe critical items within a data center contain a number of systems that control and run theelectrical and mechanical components necessary for successful operation. Many of thesesystems are tied into the Building Management Systems (BMS), and others are directly linked toIT monitoring systems. Within the past two years, the industry has taken the stance that bothBMS and IT critical systems should be managed and monitored by a single system and reportedvia a holistic dashboard. These systems are part of the building envelope and each contain aset of core delivery mechanisms and risk profiles.During Hurricane Sandy, many of the critical systems, specifically the electrical and mechanical(M&E), were severely damaged due to the fact that storm surge water entered the basementsand took down main electrical panels, water and fuel pumps, etc. Many data centers withgenerator fuel pumps located in basements had difficulty starting up backup generators, and insome cases fuel had to be manually delivered to generators located on higher floors (via thebucket brigade).In addition to the M&E systems, other common infrastructure dependencies required tomaintaining operations during a recovery period are generally related to the operations of thetelecommunications infrastructure. During a widespread outage it is critical that thetelecommunications infrastructure remain intact across the United States. Firms can mitigatethis risk by implementing resiliency through the use of circuit diversity and routing whenestablishing geographically dispersed facilities.Source: Citihub
  13. 13. 10Data Center Reliability ClassificationSeveral data center industry experts have defined reliability classifications for the data center infrastructure. The term reliability refers to avariety of subjects including availability, durability and quality, as to how the data center has been engineered. The following fiveperformance-based metrics have been defined to classify the reliability of the data center based on the Building Industry Consulting ServicesInternational (BICSI) standard for IT systems10.Class F0 Class F1 Class F2 Class F3 Class F4Single Path without AlternatePower SourceSingle Path Single Path with RedundantComponentsConcurrently Maintainable Fault Tolerant Class F0 support basicenvironmental and energyrequirements of the IT functionswithout supplementary equipment Capital cost avoidance is the majordriver There is a high risk of downtimedue to planned and unplannedevents Class F0 facilities maintenanceperformed during non-scheduledhours, and downtime of severalhours or even days has minimumimpact on the mission Critical power distribution systemseparate from the general usepower systems would not exist No back-up generator system The system might deploy powerconditioning or surge protectivedevices to allow the specificequipment to function adequately(utility grade power does not meetthe basic requirements of criticalequipment) No for power or air conditioning Class F1 support the basicenvironmental and energyrequirements of the IT functions There is high risk of downtime due toplanned and unplanned events Class F1 facilities maintenance canbe performed during non-scheduledhours, and the impact of downtime isrelatively low The critical power distribution systemwould deploy a power conditioningdevice to allow the critical equipmentto function adequately (utility gradepower does not meet the basicrequirements of critical equipment) No redundancy of any kind would beused for power or air conditioning fora similar reason Class F2 provide level of reliabilityhigher than Class F1 to reduce therisk of downtime due to componentfailure Class F2 facilities there is a moderaterisk of downtime due to planned andunplanned events Maintenance activities can typicallybe performed during unscheduledhours The critical power system would needredundancy in those parts of theelectrical distribution system that aremost likely to fail These would include any productsthat have a high parts count ormoving parts, such as UPS, controls,air conditioning, generators or ATS In addition, it may be appropriate tospecify premium quality devices thatprovide longer life or better reliability Class F3 provide additionalreliability and maintainability toreduce the risk of downtime due tonatural disasters, human-drivendisasters, planned maintenance,and repair activities Maintenance and repair activitieswill typically need to be performedduring full production time with noopportunity for curtailed operations Critical power system in a Class F3facility must provide for reliable,continuous power even when majorcomponents (or, where necessary,major subsystems) are out ofservice for repair or maintenance To protect against unplanneddowntime, the power system mustbe able to sustain operations whilea dependent component orsubsystem is out of service Class F4 eliminate downtimethrough the application of alltactics to provide continuousoperation regardless of planned orunplanned activities All recognizable single points offailure from the point of connectionto the utility to the point ofconnection to the critical loads areeliminated Systems are typically automatedto reduce the chances for humanerror and are staffed 24×7 Rigorous training is provided forthe staff to handle anycontingency Compartmentalization and faulttolerance are prime requirementsfor a Class F4 facility Critical power system in a ClassF4 facility must provide forreliable, continuous power evenwhen major components (or,where necessary, majorsubsystems) are out of service forrepair or maintenance To protect against unplanneddowntime, the power system mustbe able to sustain operations whilea dependent component orsubsystem is out of service10BICSI Standards for Data Centers, https://www.bicsi.org/default.aspx
  14. 14. 11Best PracticesFor Infrastructure DesignHigh Availability & Disaster RecoveryHigh Availability and Disaster Recovery are both concepts related to Business Continuity. Butwhereas Business Continuity applies to the whole business (including IT), HA & DR typically aremore related to IT Continuity, as part of overall Business Continuity. High Availability solutionsmainly address outages at a single site, while Disaster Recovery solutions mainly address sudden,site-wide disasters. High Availability and Disaster Recovery objectives and metrics are different.A highly available site provides resiliency from errors of the underlying platform and single pointsof failure. Availability encompasses reliability, recovery, and failure. One of the most commonmeasures of availability is the percentage of time that a given system is active and working. Thefollowing table correlates the percentage of availability to calendar time equivalents.Acceptable Uptime Downtime Per day Downtime Per month Downtime Per year99% 14.40 minutes 7 hours 3.65 days99.9% 86.40 seconds 43 minutes 8.77 hours99.99% 8.64 seconds 4 minutes 52.60 minutes99.999% 0.86 seconds 26 seconds 5.26 minutesRTO and RPORTO is the elapsed time from service interruption until service is restored. It answers thequestion: "How long can you be without service?" RTO represents a time limit that cannot beexceeded without facing severe consequences. A unified High Availability and Disaster Recoveryapproach would establish both an uptime objective and an RTO for each service.RPO, on the other hand, is the point of time represented by the data upon service resumption. Itanswers the question: "How old can the data be?"
  15. 15. 12Expectations for Continuous AvailabilityData ReplicationThe two basic methods of data replication are synchronous and asynchronous. In generalterms, synchronous capabilities are used for shorter distances, and asynchronous capabilitiesare used for longer distances. The method chosen depends on Business Recoveryrequirements.Synchronous replication ensures that a remote copy of the data, identical to the primary copy,is created at the time the primary copy is updated. In synchronous replication, an updateoperation is not considered done until completion is confirmed at both the primary andsecondary site. An incomplete operation is rolled back at both locations, ensuring that theremote copy is always an exact mirror image of the primary.Asynchronous replication places data updates in a queue on the primary server. However, itdoes not wait for the update acknowledgments on the secondary server. So, all data that did nothave time to be copied across the network on the secondary server are lost if the first serverfails. Application data may be lost in this type of failure.Most companies cannot tolerate more than a few hours or even minutes of downtime withoutserious impact to the bottom line. Synchronous data replication may be the appropriate solutionfor companies seeking the fastest possible data recovery, minimal data loss, and protectionagainst database integrity problems.
  16. 16. 13VirtualizationVirtualization makes it possible to implement Disaster Recovery plans at a significantly lowercost. Since virtual machines are hardware-independent, any physical server can be used as arecovery target for any virtual machine. As virtualization also makes it possible to consolidateworkloads onto fewer servers, organizations can significantly reduce the cost of hardware forDisaster Recovery by reducing the number of servers needed at the primary site.Many organizations have already embraced the benefits of virtualization, as it can addtremendous value to Disaster Recovery planning. Before virtualization, Disaster Recovery wasoften too expensive to implement, and many organizations chose only to protect the mostcritical applications. Consolidating multiple physical servers as virtual hosts significantly reducesthe amount of physical servers that need to be recovered in the event of an outage.
  17. 17. 14Replication and Network BandwidthNetwork bandwidth can also introduce challenges to data replication strategies. It’s important tounderstand the amount of changed data that can occur within a given period of time. Dependingon the rate of changed data in a given system, one can determine the amount of bandwidthneeded. This period of time is referred to as the replication latency window. The networkbandwidth guideline below can assist with these calculations.Database ReplicationDatabase replication is similar to database mirroring. These solutions use production databasetransaction logs to maintain a current copy of the production database on a standby server. Inthe event of a server outage, the database replication software, automatically switches thestandby database into the production database. There are traditionally no restrictions on wherethe databases can reside, provided that they can communicate with each other.Synchronous replication however, does have some drawbacks. It has a theoretical distancelimitation of 200 kilometres (km) or 124 miles, but the practical distance limitation for a busysystem could be as little as 50(km) or 30 miles.Estimated Hours To Replicate CapacityNetwork 20 GB 80 GB 120 GB 200 GB 300 GB 730 GBT1 42.33 169.31 253.97 423.28 634.92 1544.9710Base-T LAN 6.50 26.01 39.01 65.02 97.52 237.31DS3 / T3 1.50 6.02 9.03 15.05 22.57 54.93100Base-T LAN 0.65 2.60 3.90 6.50 9.75 23.73OC3 0.42 1.68 2.52 4.19 6.29 15.31OC12 0.10 0.42 0.63 1.05 1.57 3.82
  18. 18. 15Types of Backup Recovery and Replication ArchitecturesChoosing the best suited backup and recovery option for an organization can be challenging.Traditionally, businesses request little to no downtime when recovering from a disaster or othertype of outage. Implementing these types of solutions may represent a sizable investment.Management will have to decide which recovery option best fits the organization’s needs,particularly in relation to risk assessment, compliance and other requirements, as outlinedearlier in this paper.Single SiteBackup andRecoveryMulti-SiteAsynchronous DataReplicationMulti-SiteSynchronous DataReplicationCloud Backup andRecovery Backups and snapshotsrequired for off-sitestorage must be createdperiodically Data can only be as up-to-date as the lastbackup; daily, weekly ormonthly. Recovery is limited tothe point in time of thelast backup Asynchronous replication issupported by disk arrays,networks and host basedreplication products Changes to data are committed tothe source first, then buffered orjournaled and sent to thereplication target(s) Its designed to work over longdistances and greatly reducesbandwidth requirements This can introduce delays that arenearly instantaneous to severalhours, dependant on networklatency There is also no guarantee thatthe secondary system will havethe most recent copy of the data ifthe primary fails Used primarily for high-endtransactional applications thatrequire instantaneous failover ifthe primary node fails. With synchronous replication,data is written to the primaryand secondary storagesystems at the same time, andis not complete until it isacknowledged by both localand remote storage systems. Synchronous replicationrequires considerablebandwidth, which makes it alsomore expensive. Applications and data remainon-premises in this approach,with data being backed up intothe cloud and restored onto on-premises hardware when adisaster occurs. In other words, the backup inthe cloud becomes a substitutefor tape-based off-sitebackups. Many backup software vendorsnow provide options to directlyback up to popular cloudservice providers such asAT&T, Amazon, Microsoft andRackspace.
  19. 19. 16Disaster Recovery Site SelectionDuring the process of assessing the type of backup recovery and replication architecture, one ofthe key critical components is the disaster recovery site selection. Using leading industry bestpractices, the following recommendations provide guidance during a disaster recovery datacenter site selection. In general, primary and backup sites should not be subjected to the samethreat profile (severe weather risks, same power grid, and flood zones). Disaster Recovery sites should be located a significant distance11from the primary site Proven practices suggest a minimum of 50 to 200 miles from the primary data center,though neither the SEC or FSA12are specific to any mandates required Leading Disaster Recovery practices indicate between 200 and 800 miles, provided thereare no technical limitations imposed by solution architectures such as low latency /algorithmic trading, synchronous replication, and fiber channel distance limitations Avoid flood prone areas, major airport flight paths, earthquake areas and ensure diversity ofpower feeds Mitigate key man risk by ensuring labor pool resiliency (data center staff and applicationrecovery resources) and creating appropriate documentation for cross regional training112003 SEC guidelines on Disaster Recovery (http://www.sec.gov/news/studies/34-47638.htm12FSA BCM guide (http://www.fsa.gov.uk/pubs/other/bcm_guide.pdf50 to 200milesGoogle Earth Imagery 2013: Blue/Red pins (data centers), Red area (0 – 25 miles) / Yellow Area (25-200 Miles) marginal / Green (200-800 Miles)
  20. 20. 17Summary and RecommendationsTarget Focus AreasWhen performing an evaluation and assessment of IT critical infrastructure, certain issuesshould be addressed in order to properly frame and design a sound Business Recovery plan.The following interview questions can be used as a guide when assessing an environment:1. Can the IT infrastructure be trusted to withstand a major disruption?2. Has the resiliency of the Data Center, Network and Compute environment beenproven?3. Has a Disaster Recovery test been performed recently? Were the critical businessapplications included in the last test? What were the results?4. Have the business requirements been mapped to the IT infrastructure via a top-downreview?5. Does management fully understand the regulatory ramifications of not adhering tosound business recovery plans?If those questions cannot be answered, then the business may be at risk of failure because ofits inability to recover production systems.Citihub would recommend an end-to-end assessment of IT infrastructure, along with an in-depthreview of business continuity plans.A detailed infrastructure assessment of the Disaster Recovery plan and processes shouldinclude the following: A thorough review of the existing primary and backup data centers, as well as the networkand compute infrastructure, and the Disaster Recovery plan designs and architecture An assessment of critical backup systems and confirmation that generator fuel pumps arenot located in high risk areas such as basement buildings in flood zones Review schedules for regular backup exercises and confirmation of failover procedures;confirmation that critical power has been tested and generators are functioning withsufficient fuel levels. A review of regional and local FEMA flood zone maps (US), or the international equivalent,to determine the level of acceptable risk for data centres and critical systems An understanding of fuel delivery schedules and the assurance that contracts are in placefor emergency fuel delivery, taking into consideration that hospitals and emergency facilitieshave priority for fuel deliveries A review of the backup data center location, making sure that the site is outside the primarygeographic area and on separate utility grids if possible. The education of teams for preparedness, so they react proactively and at the appropriatetime (not delay in switching to backup power in the middle of the event) An evaluation of service provider backup plans to identify dependency risks The evaluation of remote access procedures and support systems; confirmation of sufficientcapacity to support key staff working remotely.
  21. 21. 18Best PracticesTo help spearhead a Business Continuity Management plan and a Disaster Recovery program,the following best practices can drive awareness of the critical nature of these processes as wellas help senior management establish or revise existing plans and eliminate gaps. Establish a planning group to develop resiliency designs and recovery strategies Build management awareness by establishing Key Performance Indicators (KPI) for DisasterRecovery to include the following:- Status of previous Disaster Recovery events/tests with periodic reports tosenior management- Other core IT competencies that are critical to Disaster Recovery planning- Periodic tests to verify implementation of the Disaster Recovery plan andreports about gaps and risks- A review process that includes the deployment of new solutions Perform Risk Assessments and Audits that will:- Complete top-down inventory assessment of all critical assets required tosustain operations- Review process structure assessments, audits, and reports- Assess gap and risks from previous events or audits- Create implementation plan to eliminate gaps- Document Disaster Recovery plan actions and escalation procedures- Build comprehensive training material- Develop test verification criteria and procedures Separate people from technology and confirm business processes that require onsite staff toresume operations Establish real remote access strategy for staff who are unable to commute during severeweather conditions
  22. 22. 19Business Continuity Management FrameworkSource: Citihub Business Continuity Management and Disaster Recovery Framework
  23. 23. 20Elements of Business Recovery PlanningThe business process assessment for determining critical areas of recovery begins with a top-down review as shown below. This approach confirms the technical infrastructure anddependencies associated with each business process.The above process enables end-to-end mapping of dependencies critical to providing anunderstanding of the key components that make up an application system. In order to determinebusiness unit IT needs, and provide a gap analysis against IT capabilities, Citihub hasdeveloped a business impact analysis methodology on critical processes and the IT systemswhich support them.The three areas of focus are:Business Unit Overview Process Summary Application RequirementsSummaryThe business unit overview andreadiness heat map is used to capturebusiness process criticality and ITcapability readiness in the event of acatastrophic outageThe key process summary examinesbusiness processes and rates theimpact of a sustained outage on thebusiness on three dimensions:Operational Impact, Financial Impactand Reputational ImpactThe application requirements gapanalysis section summarizes theapplications each business unitrequires and provides a RAG statuswhen compared against IT capabilitiesSource: CitihubSource: Citihub Business Impact Analysis Methodology
  24. 24. 21FEMA Flood MapsAppendix B illustrates one of the more critical vulnerabilities that exist within the New Yorkmetropolitan area. The storm surge during Hurricane Sandy13, which caused major flooding inparts of the region, impacted critical systems in the core BMS and data center M&E, as well astransportation infrastructures in and out of New York City and the Tri-State area.The maps are ranked high to low by impact due to flooding and storm surge severity.Rank Risk Impact MitigationLOW  No impact due to storm surge  None  Ensure redundancy site is active and testedMEDIUM  Storm surge impact can occurbut unlikely Partial or no building damageand/or access to main entrance Ensure redundancy site is active and tested Recovery plans activatedHIGH  Storm surge impact is severe  Damage to main electricalswitch gear and/or generators orfuel pumps Ensure redundancy site is active and tested Recovery plans activated Staff plan activated13http://www.nhc.noaa.gov/refresh/graphics_at3+shtml/030345.shtml?gm_esurge
  25. 25. 22Appendix AFEMA Flood Hazard Mapping - HIGHNew York Locations: Lower Manhattan and 55 Water StreetNew York Locations: 25 Broadway and 32 Ave of the AmericasNew Jersey Locations: 410 Commerce Blvd. and 760 Washington Ave.
  26. 26. 23FEMA Flood Hazard Mapping - HIGH (cont’d)New Jersey Locations: 545 Washington Blvd. and 755 Secaucus RoadNew Jersey Locations: 15 Enterprise Ave. North and 300 Boulevard East
  27. 27. 24FEMA Flood Hazard Mapping - LOWNew York Locations: 111 8th Ave. and 360 Hamilton Ave., White PlainsNew York Locations: 480 North Bedford Road, Chappaqua and 11 Skyline Drive, Hawthorne
  28. 28. 25FEMA Flood Hazard Mapping - LOW (cont’d)New Jersey Locations: 1400 Federal Blvd. and 3003 Woodbridge Ave.New Jersey Locations: 165 Halsey Street and 100 Delawanna Ave
  29. 29. 26FEMA Flood Hazard Mapping - LOW (cont’d)Chicago Locations: 350 East Cermak, Chicago, IL and 2905 Diehl Road, Aurora IL
  30. 30. 27Appendix BNatural Disaster Risk Profiles for Data CentersType On-Site Off-Site ImpactTornado In or near the storm path,expect disruption and minor tosevere infrastructure damageIn or near the storm path,expect disruption and minor tosevere infrastructure damage Advanced warning of tornado potential but no site specific warning Employees remain at site Duration is brief although intense Roof and outside equipment (cooling towers, etc.) damaged or destroyed Potential damage to the building structure Loss of local utility and communicationsHurricane In or near the storm path,expect disruption and minor tosevere infrastructure damageExpect severe region-widedamage to public infrastructure,utilities and communications Significant advanced warning Duration is hours to a few days Employees may require evacuation from site Post-storm security may be required Emergency supplies needed for at least several days Roof and outside equipment (cooling towers, etc.) damaged or destroyed Potential damage to the building structure Loss of local utility and communications Repair to regional damage may require days, weeks or longer for massivereconstruction of electric power transmission or distribution facilities Potential for off-sit public infrastructure damageEarthquake Expect catastrophic damageand disruption to data centersnear the epicenter andinfrastructure damage to datacenters further awayExpect severe region-widedamage to public infrastructure,utilities and communications No warning Brief duration with the threat of continued aftershocks Employees may be unable to leave site Emergency supplies needed for several days of operation Building structural damage Toppling of un-braced computer hardware and site infrastructure equipment includingcollapse of raised floor Site may be isolated for an extended period Highways and bridges may be damaged or destroyed preventing movement of dieselfuel and other operating supplies required for continues operation Power and communications may sustain extensive damage requiring days, weeks orlonger to repairSource: Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications
  31. 31. 28Natural Disaster Risk Profiles for Data Centers (cont’d)Type On-Site Off-Site ImpactIce Storm /BlizzardExpect some disruption orfailure of data center if outsideequipment is not designed tosurvive severe ice and snowaccumulation`Expect severe region-widedamage to public infrastructure,utilities and communications Several days warning generally expected Storm or multiple storms may last several days with accumulative effects Employees may be unable to leave or enter site Emergency supplies needed for at least several days Ice damage to structure and outside equipment Roof failure from excessive snow load Potential freezing of pipes Loss of overhead power and /or communications lines over large areas may requireseveral days, weeks or longer to repair Roads dangerous or impassableThunderstorm /LightningExpect disruption ranging fromdisaster to no impactdepending on distance tolightning strike and properoperation of surgesuppression, UPS, and engine-generator systemsExpect frequent momentarypublic utility disruptions fromlightning strikes hitting theelectric power transmission grid Special sensors can provide minutes of storm approach warning Duration is brief but may recur daily during thunderstorm season Frequent UPS battery discharges shorten remaining battery life Extended power interruption if utility service is overhead or radial and a nearbylightning strike causes protective devices to open Possible flooding and roof leakage Momentary under voltages can affect hundreds of square miles Fires started by lightning can destroy public infrastructure located in rural areasFlood Expect catastrophic damageand disruption to data centersin severe flood areas or withinfrastructure systems belowgradeExpect severe region-widedamage to public infrastructure,utilities and communications Several day warning generally expected Employees may be unable to leave site Emergency supplies needed for at least several days operation Site infrastructure damage requiring days to weeks to repair Site may be isolated for an extended period Highways and bridges may be damaged preventing movement of diesel fuel andother operating supplies required for continues operation Power and communications may sustain extensive damage requiring days, weeks orlonger to repairSource: Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications
  32. 32. 29Appendix CEast Coast Liquidity VenuesThe New York metro area is responsible for approximately 94% of the volume of shares tradedfor the US cash equity market14. The following maps illustrate the major liquidity venues in theNew York and Chicago metropolitan locations.New YorkChicago14http://www.batstrading.com/market_data/daily_volume/
  33. 33. 30Works Cited & ReferencesInteragency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System,September 2005. http://www.sec.gov/news/studies/34-47638.htmDodd-Frank H.R. 4173 Wall Street Reform and Consumer Protection Act, January 2010National Public Radio (NPR), Visualizing The U.S. Electric Grid, April 24th2009,www.npr.org/templates/story/story.php?storyId=110997398NOAA National Climatic Data Center, State of the Climate: National Overview for Annual 2012, publishedonline December 2012 from http://www.ncdc.noaa.gov/sotc/national/2012/13.Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publicationsUnited States Geological Survey, http://earthquake.usgs.gov/earthquakes/states/new_york/hazards.phpBICSI Standards for Data Centers, https://www.bicsi.org/default.aspxColocation Selection, best practices and critical considerations for choosing the right data centercolocation solution. Bill Kleyman, Cloud and Virtualization Architect, October 2012Climate Change and Infrastructure, Urban Systems, and Vulnerabilities, Technical Report for the U.S.Department of Energy in Support of the National Climate Assessment, February 29, 2012The historic nor’easter of 13-14 March 2010, Richard H. Grumm, National Weather Service
  34. 34. 31About CitihubFounded in 1998, Citihub provides IT expertise to some of the world’s leading enterpriseorganizations and is comprised of industry veterans who relish the challenge of complextechnology and cultural change. We take a fresh approach to the technical challenges of todayand believe in partnering with our clients through change. Citihub clients include InvestmentBanks, Hedge Funds, Media, and Manufacturing.About the AuthorsVincent PellyVincent Pelly is an Associate Partner at Citihub with more than 30 years of experience acrossthe financial services industry with specialization in infrastructure, program management and ITstrategy. He has extensive experience managing large enterprise projects in infrastructure anddata center advisory and technology implementation, and has managed large infrastructuretransformation programs.Scott HaglundScott Haglund is an independent consultant with more than 30 years of experience in thedevelopment and execution of global infrastructure strategy, architecture, transformation,technology roadmaps, optimization, and service delivery standards for the enterprise. Hespecializes in data center automation strategies, and has led many enterprise infrastructuretransformation programs.

×