Utilising the Cloud forDisaster RecoveryCraig Scott – Head of ICT ServicesSouth Tyneside CollegeSupported by AoC
Utilising the Cloud for Disaster Recovery 3IntroductionThe Disaster Recovery is something that IT Managers spend a conside...
4 Utilising the Cloud for Disaster RecoveryAs a general rule Disaster Recovery refers to the provision of offsite faciliti...
Utilising the Cloud for Disaster Recovery 5ObjectivesThe success of any project is dependent upon clearly defined and unde...
6 Utilising the Cloud for Disaster RecoveryCapacity/PerformanceWhat sort of capacity and performance is acceptable for you...
Utilising the Cloud for Disaster Recovery 7Selecting a Cloud ProviderPlatformThe cloud is a growth area within the IT sect...
8 Utilising the Cloud for Disaster RecoveryIaaS services typically require Virtual Private Networks (VPN) to connect the h...
Utilising the Cloud for Disaster Recovery 9Suggested role are listed below:Data Replication StrategyObviously it is necess...
10 Utilising the Cloud for Disaster RecoveryVirtual Machine ReplicationReplication or cloning of entire virtual machines i...
Utilising the Cloud for Disaster Recovery 11You do need to ensure that you are adequately licenced for any software you in...
12 Utilising the Cloud for Disaster RecoveryDNS for FailoverIt is assumed that you have created a Domain Controller in you...
Utilising the Cloud for Disaster Recovery 13The best approach is to take a scenario based approach, start with the highest...
14 Utilising the Cloud for Disaster RecoveryFor reasons of operational efficiency a decision was taken to close Hebburn. D...
Utilising the Cloud for Disaster Recovery 15PlanningNumbersDue to the levels of reliance and HA provided by the equipment ...
16 Utilising the Cloud for Disaster RecoveryFile Level ReplicationFile level replication was used to replicate data from t...
Utilising the Cloud for Disaster Recovery 17Implementation ProcessImplementation of the solution was approached via the fo...
18 Utilising the Cloud for Disaster RecoveryFuture DevelopmentsPartially as a result of our experiences with this project ...
Utilising the Cloud for Disaster Recovery 19Provider OS Virtual Machines Storage Bandwidth VPN RequirementsEst CostPer Mon...
20 Utilising the Cloud for Disaster RecoveryAppendix 2Disaster Scope Impact AssessmentControlsResidual RiskCollege Campus ...
Utilising the Cloud for Disaster Recovery 21Appendix 3Server System Workload Scope Why Notin Scope?ReplicationStrategyReco...
22 Utilising the Cloud for Disaster RecoverySystem Downtime Trigger Failover Authorisation Sequence Role ActionStudent Rec...
Association of Colleges2-5 Stedham PlaceLondonWC1A 1HUTelephone: 020 7034 9900Facsimile: 020 7034 9950Email: sharedservice...
Disaster recovery toolkit final version
Disaster recovery toolkit final version
Upcoming SlideShare
Loading in …5
×

Disaster recovery toolkit final version

1,024 views
959 views

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,024
On SlideShare
0
From Embeds
0
Number of Embeds
673
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Disaster recovery toolkit final version

  1. 1. Utilising the Cloud forDisaster RecoveryCraig Scott – Head of ICT ServicesSouth Tyneside CollegeSupported by AoC
  2. 2. Utilising the Cloud for Disaster Recovery 3IntroductionThe Disaster Recovery is something that IT Managers spend a considerable amount of time planningand preparing for with the hope they will never have to implement those plans. Over the years usershave come to expect IT to be “always on” and available 24/7 to allow them to study or carry outthe duties associated with their job role. These availability and reliability expectations also impactson disaster recovery provision, it is no longer sufficient to rely on restoration from backup insteadredundant hardware and facilities are required. This paper discusses factors that must be consideredwhen planning for disaster recovery and identifies how cloud services can be used as a disasterrecovery solution.Determining Project ScopeDisaster Recovery – what is it?The most important starting point for the project is to define what you mean by “Disaster Recovery”.To you and your team is a disaster the failure of a single server? A fire in your data centre? A poweroutage to your entire site or all of the above?Until you know what you’re trying to protect yourself from its difficult to ensure that you have adequateprocess and procedures in place. A risk based approach can help you to identify potential disasters, theimpact they will have on your services and likelihood of their occurrence.Disaster Recovery vs. High AvailabilityHigh Availability (HA) is typically used to describe systems which are connected by high speed lowlatency links and often have shared components. Many vendors provide failover clustering technologiesthat provide high availability solutions, such as Microsoft Windows Failover Clustering, Oracle RealApplication Clusters, etc…HA solutions are designed to minimise the downtime of business critical services and can protect againsthardware failure of specific components. HA clusters generally offer automated failover with minimaldata loss. Typically the constituent parts of a failover cluster are located in the same data centre, or are alllocated on the same LAN (i.e. multiple datacentres within the same building/campus).High Availability
  3. 3. 4 Utilising the Cloud for Disaster RecoveryAs a general rule Disaster Recovery refers to the provision of offsite facilities that are geographicallyseparate from the primary facilities. A consequence of the geographic separation is the introduction ofhigher latency links. The high levels of latency, and potential unreliability of these links makes themunsuitable for use by many clustering technologies.The lines between HA and DR do become blurred by some newer technologies which can be used toprovide the levels of failover and reliability typically associated with HA over WAN links. MicrosoftExchange Database Availability Groups being a typical example.Defence in DepthHA and DR are not mutual exclusive options and can be combined to further reduce the risk ofservice outage.Disaster Recovery
  4. 4. Utilising the Cloud for Disaster Recovery 5ObjectivesThe success of any project is dependent upon clearly defined and understood objectives, without whichit is impossible to measure the success or effectiveness of the project. The exact objectives will vary fromproject to project but at a minimum you should consider:Physical SeparationBased on your risk assessment of the potential disasters what is the minimum level of physicalseparation you require between your live and DR systems? Options to consider include:• Different building• Different campus• Different town/city• Different area of the country• Different country• Different continentAcceptable DowntimeThe initial reaction from many IT managers and business managers is that no downtime is acceptable.However, if the building containing your primary data centre and finance department burns to theground it will take time for the finance team to be relocated to different premises, it will take time tofind computers for them to use etc… therefore how quickly do you really need to restore access to yourfinance system?Acceptable Data Loss WindowWhilst zero data loss is certainly desirable as the level of synchronicity between live and DR systemsincreases so do the costs, either in terms of the technology required or bandwidth utilised to maintainsynchronicity.Databases which handle real time transactions, such as on-line or face-to-face enrolments, normallyrequire a small data loss window, ideally the window should be no more than a handful of transactions.If you lose a day of transactions can you recreate that data? Does the person who enrolled via yourwebsite know you have lost their data? Do you even know who they are?For other systems a high window may be more acceptable, what would be the impact of losing the last3-4 hours of data from your file servers? Is this any different from someone forgetting to press save andlosing a file?
  5. 5. 6 Utilising the Cloud for Disaster RecoveryCapacity/PerformanceWhat sort of capacity and performance is acceptable for your DR services? Thought needs to be given asto whether your DR services need to give your users the same level of performance as your live systems.Your DR system may introduce new bottlenecks to the mix such as available WAN/internet bandwidthbetween DR facilities and users. The amount of expansion capacity and historical capacity also needs tobe considered.Acceptable Restoration TimeIf you have had to activate your DR services at some point you’ll want to switch back to your liveservices. How will you do this? Will the failback result in any downtime?The answers to many of the questions you will need to ask yourself will vary from system to system.The Cloud OptionsMaintaining DR facilities can be expensive, both in terms of investment in hardware, hardware whichyou hope you will never need to use, and time to maintain and administer the DR hardware. Use of thecloud to host your DR facilities can eliminate or reduce a number of these costs.Most major Cloud providers have globally dispersed redundant data centres which that will generallybe hundreds of miles away from your facilities.• Infrastructure as a Service (IaaS) Selection of an IaaS option will remove the need to invest in hardware and construct a secondaryserver room/data centre. An IaaS DR solution involves renting sufficient computing resources froma cloud provider to allow you to create a “virtual data centre” in the cloud. You are then responsiblefor creating and maintaining the virtual machines which provide your DR facilities.• Platform as a Service (PaaS) With PaaS the cloud provider is responsible for the hardware, operating systems and services.This removes the need for you to maintain and patch virtual machines. An example of PaaS isthe Microsoft Azure SQL Database service, Microsoft are responsible for the hardware, operatingsystems and SQL Server installation, you only need be concerned about your database.In some cases you may be forced down an IaaS route due to the need to install 3rd party software on aserver, in other cases PaaS may be appropriate. For example, you may need to use IaaS for your financesystem DR as you need to install a 3rd party finance server product but you can use PaaS to provide DRfor your website.Alternatives to Disaster Recovery - Software as a Service (SaaS)When looking at the services for which you need to provide DR facilities it is worth asking the questionof whether there is a better way to deliver those services. By moving services such as e-mail fromtraditional on-premises hosted solutions to cloud hosting you obviate the need to invest time and moneyin providing DR facilities for those services, the availability and accessibility of those services becomesthe cloud providers concern.
  6. 6. Utilising the Cloud for Disaster Recovery 7Selecting a Cloud ProviderPlatformThe cloud is a growth area within the IT sector that is rapidly expanding, both in terms of servicesoffered and companies providing those services. Some providers have invested in the development ofproprietary platforms, such as Amazon E2C or Windows Azure, whilst other providers have developedservices based on “off the shelf” products, such as VMWare.CompatibilityCompatibility between your cloud provider’s platform and your on-premises virtualisation platform canaffect the options available for your data replication strategy. If the two platforms are compatible or canbe managed by the same virtualisation management platform, such as Microsoft System Centre VirtualMachine Manager, you may be able to move, or replicate, data and virtual machines between your on-premises solution and your cloud solution.ComplianceThe requirements of the Data Protection Act (1998) are often cited as being a barrier to the use of thecloud, in particular the need to obtain subject consent prior to transferring data outside of the EU. Youshould not assume that because a cloud provider is based in the UK, or Europe, that your data will bestored within the EU.Most major cloud providers have data centres located within the EU and some allow you to select the“region” or even individual data centre that will be used to store your data.SecurityPhysicalReputable cloud service providers should be able to provide information on the levels of securityaccreditation to which their services and data centres comply. Many providers will be delivering servicesto customers in the financial, health care, defence sectors as well as local and national governments andas such will already comply with extremely stringent security requirements.ConnectivityFor your data to reach the data centres of your chosen cloud provider it will probably need to travelacross the public internet. It is important to ensure that the data is protected in transit.Most SaaS and PaaS solutions have been developed from the ground up as internet services and willmake use of SSL HTTPS to provide secure connectivity. For example HTTPS to connect to a web basedSaaS e-mail solution or SFTP to transfer files to a PaaS hosted website.
  7. 7. 8 Utilising the Cloud for Disaster RecoveryIaaS services typically require Virtual Private Networks (VPN) to connect the hosted virtual machines toyour on-premises LAN. Site-to-site VPN’s require a device at both sites to “terminate” the connection,therefore it is important to confirm that you have a suitable end point device capable of handling yourend of the connection and that the device will work with your cloud providers VPN implementation.Pricing Model Contract OfferingsIs it necessary for all of your DR assets to be operational 24x7? or do you simply need them ready andwaiting to be fired up?Most cloud providers pricing is based on the size, allocated storage and hours of usage of a virtualmachine. Applications which are built around an n-tier model will have application servers that hostwebsites or application software. You may only need to fire up the virtual machines hosting theseapplication server roles for a few hours a month for testing and patching. Does your cloud providerspricing structure reflect this usage model?Understanding RiskAn analysis of the roles and workloads of your systems will help you to identify the level of risk that theloss of a system poses and therefore the level of DR protection and effort that it warrants.Systems are often comprised of multiple servers each fulfilling distinct roles. The impact of loss, and easeof restoration, will vary depending upon the role of the server.
  8. 8. Utilising the Cloud for Disaster Recovery 9Suggested role are listed below:Data Replication StrategyObviously it is necessary for the data in each of your DR systems to be updated regularly and to be noolder then the acceptable data loss window you have identify for that system. It is important to select areplication method that is appropriate for the level of risk and acceptable data loss window.ApproachesApplication ReplicationMany enterprise class applications incorporate their own replication technologies, for example,Microsoft Exchange Database Availability Groups, Oracle Data Guard, MySQL master/slave replicationetc… Where application replication technologies are available they should be considered as the preferredoption as they are designed to replicate data in a manner that makes sense to the application.File System LevelIn some cases simply copying files from the live systems to the DR systems will suffice to replicatethe data.Tools such as “robocopy” and “rsync” are able to intelligently determine what differences exist betweensource and destination locations and only copy new or changed files to the DR location as well asremoving redundant files from the DR site. Services such as the “Distributed File System” (DFS) builtinto Windows server can be used to automate and manage file replication.It is important to check that a file system copy is appropriate for the type of data being replicated.Using file system replication to copy the data files of your SQL Server whilst it is running could resultin data corruption.Role DescriptionData ChangeFrequencyEase ofRecreatingDataAcceptableData LossExamplesData StorageServers holding non-transactional dataHigh ModerateModerate 4 hoursFile servers, mailbox serversetc… where users can recreatedocumentsDatabase Databases servers High LowLow 30minutesSQL Server, Oracle, MySQLetc… especially on-line systemwhere may not be possible torecreate data (i.e. e-registers,on-line enrolment)ApplicationServers which do notstore volatile dataLow High HighWeb servers, middle-tierservers etc… static contentupdated infrequently (i.e.software upgrades, websiteredesign etc…)
  9. 9. 10 Utilising the Cloud for Disaster RecoveryVirtual Machine ReplicationReplication or cloning of entire virtual machines is also a strategy that should be considered. This isespecially useful for cases where all the components of a single system are located on a distinct virtualmachine. This approach should also be consider for application/middle-tier servers where significanttime and effort has been expended customising or configuring the middle-tier components.Best ApproachComplex systems often consist of multiple servers each of which has a distinct role within that system.Consider a student records system, this will probably consist of a database server, two identicalapplication servers and a client application. Your database will be experiencing constant changesand you need to ensure that in the event of a disaster you don’t lose any records, on the other handthe software on the application servers is updated via a controlled process every 6 months when thesoftware vendor releases an update. In this scenario it would be appropriate to make use of the databasesystems inbuilt replication technology to protect your database and to use virtual machine replication toreplicate one of the application servers, you might only replicate the virtual machine once a month as ithas a low degree of data volatility.Software LicencesTypically when you create a virtual machine in the cloud the machine will be based on a template whichhas a cost associated with it, usually charged hourly, weekly or monthly. In most cases these pricesinclude the cost of the licence for the operating system used by the template.The same usually applies to PaaS in that the charge for the period will include the licence costs for all thecomponents of that service. For example, you don’t need to purchase licences for Microsoft SQL Serverto use the Microsoft’s Azure SQL Database platform.Pro’s Con’s Data Granularity Recommended ForApplication • Application aware• Transaction rollback• Corruption detection• Automatic failover• Can be complicatedto setup• Requires twoinstallations ofapplication software• May require additionallicences• May introduceadditional overheadon live systemsVariable but appropriatefor application (i.e.database transaction,Active Directory object,e-mail message etc…)• Databases• Mailboxes• LDAP (inc ActiveDirectory)File System• Simple to set up• Excludes open files• Requires scripts and/or additional softwareFile level • File sharesVMReplication• Replicates entire server • Can be complicated tosetup• Lots of data to transfer• Servers may requirereconfiguration onceactivatedVirtual Machine (thoughsome solutions allowblock level)• Applicationservers• “1 server”systems
  10. 10. Utilising the Cloud for Disaster Recovery 11You do need to ensure that you are adequately licenced for any software you install on the virtualmachines you create in the cloud. Consider a scenario where you create a virtual machine to hostMicrosoft Exchange Server because you want to use Exchange Database Availability Groups to provideapplication level replication for your e-mail system, in this scenario you probably wouldn’t need topurchase a licence for Windows Server (as this will be included in the cost you are paying for the virtualmachine) but you will need to buy a licence for the copy of Microsoft Exchange you have installed onthat server.Some software vendors incorporate provision in their education and volume licencing schemes thatallows you to install additional copies of their software for disaster recovery purposes.Obviously you don’t want to spend money on licences you don’t need. Try checking the softwarevendor’s website for licencing FAQ’s, contacting the retailers who you purchased the software from orcontacting the vendors directly if you are ensure about what you are or aren’t allowed to do with yourexisting licences.Considering FailoverIf you have to activate your DR facilities how will your users and client devices know where to find thesystems they need to connect to?Most modern networks make use of DNS to locate servers and services, in some cases you may be usingIP addresses to locate services. It is probably that your DR facilities will be on a different IP subnet fromyour live systems, your clients need to be informed of this to allow them to connect to your DR facilities.Active Directory DNSAssuming that you are utilising Microsoft Active Directory (AD) the servers on your DR site will needaccess to the AD and associated DNS in order to operate. Therefore it is recommended that you maintainat least one operational Domain Controller in your DR facilities. This will also provide inherent DR foryour AD and DNS infrastructures without any further work on your part.IP Address AllocationIf you have chosen to replicate virtual machines to your DR site do these virtual machines have static IPaddresses assigned? If so you will need to login to each VM as you bring it online and assign a new IPaddress. Consider whether you can use DHCP to assign IP addresses to your servers.Application Aware FailoverIf an application has some form of application level replication it may also have application levelfailover. Microsoft Exchange Database Availably Groups (DAG) are such an example, with DAG’s theExchange client access servers automatically connect to the mailbox server which is hosting the activedatabase.Distribute File System (DFS)Switching to an alternate file server normally involves finding all references to the UNC path of thefailed file server and replacing them with references to the new file server.DFS allows the creation of a fault tolerant file share containing folders that refer to one or more real fileshares. By configuring an active and inactive referral for each file share, one referencing your live systemand the other your DR system, all you need do to failover is change the referrals appropriately.
  11. 11. 12 Utilising the Cloud for Disaster RecoveryDNS for FailoverIt is assumed that you have created a Domain Controller in your DR site that is also a DNS server, thusproviding resilience for your DNS. Most of your clients will be using DNS to locate the servers andservices to which they connect, in many cases switching to your DR facilities may involve no more thanchanging DNS entries so they point at the DR system.Consideration needs to be given to the TTL value of the DNS entries as these determine the length oftime your clients will cache the returned DNS data. If your records have a TTL of an hour it could takethat long before some of your clients can access your DR services. You should ensure that the TTL valuesfor the critical DNS records are set to values that are consistent with your failover objectives.When planning for DR it is recommend to review the way your clients currently locate their servers,where possible try to avoid the use of IP addresses or server names and use DNS aliases (CNAME)records. For example, instead of using http://servername.college.ac.uk/ebs create a DNS CNAME forebs-live.college.ac.uk which refers to servername.college.ac.uk that way if you have to switch to your DRsystem all you need do is update the CNAME record.Replicated Virtual MachinesIn most cases failover of replicated VM’s will be as simple as powering on the VM, checking it has anappropriate IP address and ensuring that DNS reflects the current IP address.Where the VM is a part of a multi-tier application and you have also failed over database tiercomponents you may need to update the application with the new address of the database server. Thisprocess can be simplified through the use of DNS aliases and application specific redirects, for example,you might create an DNS alias for “studentrecords-live.college.ac.uk” which points at your live databaseserver, you then use this address when installing/configuration application-tier components, in theevent of failure all you need to do is change where the DNS alias points.Network Load BalancersNetwork load balancers (NLB) provide an option for failover of some services, good quality loadbalancers will be able to detect server and application failure automatically and redirect traffic. However,you also need to consider DR for your NLB, if you position an NLB on your live site which is configuredto redirect traffic to your DR site what will you do if your NLB is out of action?PlanningOnce you’ve carried out your risk assessment you will have a better idea of the disasters that you mayencounter and the how what the probability of each disaster is. As you have hopefully realised you areprobably more likely to encounter situations where one, or a small number, of related systems havefailed, probably as a result of hardware failure or software problem. The level of detail involved in yourDR plan should reflect how critical the system is and how quickly it needs to be recovered.You may have generic processes that apply across multiple systems, for example, if you have multipledatabase servers with identical DR processes a single process is probably sufficient.Whilst it is possible to create detailed scripts and automated procedures that can be sued to activate DRfacilities every disaster tends to be different and needs to be assessed individually. The process to fix adisaster of type A may in fact make a disaster of type B worse.
  12. 12. Utilising the Cloud for Disaster Recovery 13The best approach is to take a scenario based approach, start with the highest probability highestimpact risks and work down to those with the lowest probability and impact.An important consideration in your planning is who has the authority to declare a “disaster” and invokethe DR plan? In some cases invoking the DR plan may result in more overall disruption then it would toleave a particular service offline for an hour while you fix it.TestingIt is essential to test your DR processes regularly. The scope of testing needs to be considered on a systemby system basis, also consider if you need to test every system? again if you have 20 servers with anidentical process do you need to test them all regularly?For systems with transparent application level replication and failover testing should be straight forwardand can be done regularly. In cases where a failover would be disruptive is simulating failover sufficientfor the system in question?Example ImplementationBackgroundUntil the summer of 2011 South Tyneside College (STC) operated across two major campus (Westoe Hebburn) and a third specialist campus (MSTC). STC’s primary data centre was located on the maincampus (Westoe) with a smaller sever room at Hebburn, the MSTC has only a single server. Systems hadbeen established for some time to replicate data and services between Westoe and Hebburn allowingeither campus to act as DR site for the other.
  13. 13. 14 Utilising the Cloud for Disaster RecoveryFor reasons of operational efficiency a decision was taken to close Hebburn. Due to the high cost ofcreating the necessary facilities and upgrading the data links it was not feasible to establish DR facilitiesfor Westoe at the MSTC. A redundant server room in a separate building on the Westoe campus wasrefurbished for DR use.ChallengeThe primary data centre supports 46 physical servers and 69 virtual machines, a further 16 physicalservers are located in the secondary server room providing support for DR. The hardware in thesecondary server room had previously been the “live” hardware from Hebburn and was planned forreplacement in summer 2013. Estimated costs for replacing this equipment were expected to be in theregion of £50,000 - £60,000. Examination of the available options indicated that the use of the cloud forour DR facilities would result savings of around 10-15% and provide a truly offsite solution. The workinvolved would also allow us to gradually migrate a number of live services from on-premises to thecloud in future, producing further cost savings.
  14. 14. Utilising the Cloud for Disaster Recovery 15PlanningNumbersDue to the levels of reliance and HA provided by the equipment in the primary data centre which meantthat we only expect to need to activate the DR facilities in the event of a disaster which renders our maincampus unusable (fire, floor, prolonged power outage etc..). Under these circumstances we anticipatethat the major performance bottleneck will be the available bandwidth of the internet connection(s) usedto connect to the virtual data centre.Based on this supposition the following criteria were applied to determine if a system or server waswithin scope of the project.• Where multiple load balanced application servers for the same service existed we would onlyprovide one DR server• Where we had split large workloads across non-load balanced servers (i.e. file servers) we wouldconsolidate these workloads on one DR server• Servers in DMZ would be excluded where these services duplicated LAN servers which are in scope• Servers which were used to support physical equipment which would likely be inaccessible during adisaster would be excluded from scope. This was based on the grounds that if our buildings are outof action so will be the equipment they contain therefore print servers, wi-fi controllers etc… wouldnot be required.Analysis of the roles and workloads of our servers indicated that our disaster recovery strategy neededto support a minimum of 29 servers.WorkloadsOf the 29 systems within project scope we identified 9 database servers and 3 data store servers (fileserver, mailbox Active Directory). The remaining servers fit into the application server category.Replication Failover StrategiesBased on the workloads of the systems in scope a combination of application level, file system leveland virtual machine cloning was adopted. For a small number of cases it was recognised that the bestoption was to build a new application server in the cloud due to the comprehensive application levelfunctionality provided by that system, for example, Microsoft Exchange Client Access Servers.Application Level ReplicationApplication level replication was select for Active Directory (AD has inherent replication), MicrosoftSQL Server, MySQL Server, Microsoft Exchange. All of these applications have built in multi-serverreplication mechanisms which allowed for recovery windows of less than 15 minutes.Failover procedures for these systems are either automatic/inherent (i.e. Active Directory Exchange),or requires a flag setting within the application to indicate the primary server (SQL Server MySQL).In the case of SQL Server and MySQL Server it is also necessary to update the configurations of theapplication servers/client applications to reference the DR servers as opposed to the live servers.
  15. 15. 16 Utilising the Cloud for Disaster RecoveryFile Level ReplicationFile level replication was used to replicate data from the 4 on-premises file servers to the single cloudbased file server using the built in “robocopy” command and its mirroring/synchronization option. Thesynchronization was scheduled to run overnight as a one working day recovery window was deemedadequate for file services.As Microsoft DFS is used in all links and paths that reference the file shares on the file servers failoverinvolves disabling the referral to the on-premises servers and enabling the referral to the cloud servers.Virtual Machine Replication – Database ServersA small number of simple systems, some quite critical, have all their components installed on a singlevirtual machine. These applications either do not have a high workload, or do not have scalablearchitectures. Systems falling within this category include the payroll system, library managementsystem, active directory certificate services and an Oracle Express server used for teaching purposes. Forthese systems virtual machine level replication was selected with a nightly replication interval.Failover requires the virtual machines be brought on-line, they will automatically register their new IPaddress with DNS.Virtual Machine Replication – Application ServersThe remaining systems all fulfilled application front end/middle tier roles, therefore virtual machinereplication was selected as the replication strategy. As updates and changes to the live servers arecarried out via a controlled change management process a weekly virtual machine refresh was deemedsufficient.Failover requires the virtual machines be brought on-line, they will automatically register their new IPaddress with DNS. In some cases it is also necessary to update the database server references to refer tothe DR database servers.Cloud Provider SelectionOnce the workloads, replication and failover strategies had been decided upon a review of the servicesoffered by various cloud service providers was undertaken.As it was identified that 60% of the virtual machines required for the DR solution would only need to bepowered up for testing and patching for a couple of hours each month providers with an hourly pricingmodel were favoured.Compatibility with existing systems was also a factor in provider selection. The virtualisationinfrastructure at STC is based on Microsoft Hyper-V (Windows 2008 R2) managed by Microsoft SystemCentre Virtual Machine Manager (MSCVMM) 2012. Therefore solutions that offered managementintegration with MSCVMM and virtual machine migration from Hyper-V were favoured.Consideration of the above factors, plus pricing, resulted in the selection of the Microsoft WindowsAzure platform, Microsoft were able to offer favourable educational pricing. However as we were thefirst UK institution to sign up to Azure via an education agreement we discovered Microsoft’s signupprocedure were not fully developed which resulted in delays of many months. It should be noted thatwe have been assured by Microsoft that these procedures are now fully developed and have been usedsuccessfully by other institutions.
  16. 16. Utilising the Cloud for Disaster Recovery 17Implementation ProcessImplementation of the solution was approached via the following sequence:1. Establish VPN connectivity STC uses a pair of Smoothwall UTM-3000 appliances to provide internet content filtering andfirewall services. The Smoothwall UTM-3000 supports IPSec site-to-site VPN’s as does WindowsAzure. Establishing a site-to-site VPN between the two systems was relatively straight forward.2. Build commission Domain Controller in Azure The first server created in Azure was a Domain Controller to provide Active Directory and DNSservices to our other servers. This was accomplished through installation of a Windows 2008 R2 ona new virtual machine which we then promoted this server to a Domain Controller and installed theDNS server role.3. Build database, mailbox and file servers Servers were built to host these roles and the appropriate application software installed (i.e.Microsoft Exchange, Microsoft SQL Server etc…)4. Establish Replication Application level replication was established for: • Exchange – DR server was added to Exchange Database Availability Group and existing mailboxdatabase with the DAG had a new replication targets added. • SQL Server – database log shipping was selected as the most appropriate replication method andusing the wizards built into SQL Server Management Studio new log shipping partnerships werecreated. • File Servers – initial replication of file data was accomplished via the “robocopy” command linetool, subsequent replication runs made use of the “/mir” switch to synchronize the data on thereplica servers5. Establish virtual machine replication Virtual machine replication was initially achieved through the copying backups of the VHD files oflive virtual machines to Azure using the “csupload” command line tool. However work is on-goingto use System Centre App Controller and System Centre Orchestrator to accomplish these tasks infuture.
  17. 17. 18 Utilising the Cloud for Disaster RecoveryFuture DevelopmentsPartially as a result of our experiences with this project it is the intention of STC to make significantlymore use cloud computing services. In some cases we have identified that increased adoption of cloudservices may in fact increase costs but offers us significantly better functionality.• Office 365 A project is underway to migrate all staff student e-mail content, 500GB of SharePoint content, andthe contents of staff student “My Documents” folders (approximately 1TB of files) to Office 365.• Hyper-V Replica Windows Server 2012 introduced the ability to have active/passive replicas of individual virtualmachines. An Azure implementation of this technology is in development which allows Azure toparticipate as one side of this partnership. Once available this solution will be used to accomplishVM replication to Azure.• Server Migration Work carried out to date has proven that it is feasible and practical for us to host servers in WindowsAzure. Over the next 3 years an increasing proportion of our server infrastructure will be movedfrom on-premises hardware to Azure. The migration to Office 365 is the first step of this process as iteliminates the need for on-premises e-mail, file storage and SharePoint servers.
  18. 18. Utilising the Cloud for Disaster Recovery 19Provider OS Virtual Machines Storage Bandwidth VPN RequirementsEst CostPer MonthAnnualCostSmall Medium Large Space IOPS In Out SmallVMs MediumVMs LargeVMs Storage Bandwidth VPNCPU RAM HDD Price/hrCPU RAM HDD Price/hrCPU RAM HDD Price/hrPrice/GB permonthPrice/millionpermonthPrice/GB permonthPrice/GB permonthPricePerHourNo. HoursPerVM PerMonthNo. HoursPerVM PerMonthNo. HoursPerVM PerMonthGB permonthIOPS permonthIn GBpermonthOutGB permonthHoursPerMonthAppendix 1
  19. 19. 20 Utilising the Cloud for Disaster RecoveryAppendix 2Disaster Scope Impact AssessmentControlsResidual RiskCollege Campus Building Service Downtime Liklihood Impact Score Downtime Liklihood Impact Score0 00 00 00 00 00 00 00 0
  20. 20. Utilising the Cloud for Disaster Recovery 21Appendix 3Server System Workload Scope Why Notin Scope?ReplicationStrategyRecoveryWindowState Failover Size CPU RAM Storage OperatingSystem
  21. 21. 22 Utilising the Cloud for Disaster RecoverySystem Downtime Trigger Failover Authorisation Sequence Role ActionStudent Records 30 minutes IT Manager 1 Database Active standby mirror2 ApplicationUpdate HKLMSoftwareAdatumStudentRecordsDatabaseServer3 Clients Advise users to rebootFinance System 4 hours IT manager 1 Database Active standby mirrorAppendix 4
  22. 22. Association of Colleges2-5 Stedham PlaceLondonWC1A 1HUTelephone: 020 7034 9900Facsimile: 020 7034 9950Email: sharedservices@aoc.co.ukOr visit our web sitewww.aoc.co.uk

×