Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 1 of 22
Using Amazon Web Services for Disaster Recovery
October 2014
Glen Robinson, Attila Narin, and Chris Elleman
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 2 of 22
Contents
Introduction...............................................................................................................................................................3
Recovery Time Objective and Recovery Point Objective ................................................................................................4
Traditional DR Investment Practices ............................................................................................................................4
AWS Services and Features Essential for Disaster Recovery...........................................................................................5
Example Disaster Recovery Scenarioswith AWS...........................................................................................................9
Backup and Restore ................................................................................................................................................9
Pilot Light for Quick Recovery into AWS .................................................................................................................11
Warm Standby Solution in AWS.............................................................................................................................14
Multi-Site Solution Deployed on AWS and On-Site..................................................................................................16
AWS Production to an AWS DR Solution Using Multiple AWS Regions......................................................................18
Replication of Data...................................................................................................................................................18
Failing Back from a Disaster.......................................................................................................................................19
Improving Your DR Plan ............................................................................................................................................20
Software Licensing and DR........................................................................................................................................21
Conclusion ...............................................................................................................................................................21
Further Reading........................................................................................................................................................22
Document Revisions.................................................................................................................................................22
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 3 of 22
Abstract
In the eventof a disaster,youcan quicklylaunchresourcesinAmazonWebServices(AWS) toensure business
continuity.Thiswhitepaperhighlights AWSservicesandfeatures thatyoucan leverage foryour disasterrecovery (DR)
processes tosignificantlyminimize the impactonyourdata,your system, andyouroverall businessoperations.The
whitepaperalso includes scenarios thatshow you,step-by-step,how toimprove yourDRplan andleverage the full
potential of the AWScloudfordisasterrecovery.
Introduction
Disasterrecovery (DR) isaboutpreparingforand recoveringfromadisaster.Anyeventthathasa negative impacton a
company’s businesscontinuityorfinances couldbe termedadisaster.Thisincludes hardware orsoftware failure, a
networkoutage, apoweroutage,physical damage toa building like fire orflooding,humanerror,orsome other
significantevent.
To minimize the impactof adisaster, companiesinvesttime andresourcestoplan andprepare, totrainemployees,and
to documentandupdate processes. The amountof investmentforDRplanning fora particularsystemcanvary
dramaticallydependingonthe cost of a potential outage. Companiesthathave traditionalphysical environments
typically mustduplicate theirinfrastructure toensure the availabilityof spare capacityin the eventof adisaster.The
infrastructure needstobe procured,installed,andmaintainedsothatitis ready to supportthe anticipatedcapacity
requirements. Duringnormal operations,the infrastructure typically isunder-utilizedorover-provisioned.
WithAmazonWeb Services(AWS),yourcompanycanscale up itsinfrastructure onan as-needed,pay-as-you-gobasis.
You getaccess to the same highlysecure, reliable, andfastinfrastructure thatAmazonusestorunits ownglobal
networkof websites.AWSalsogivesyou the flexibility toquickly change andoptimize resourcesduringaDR event,
whichcan resultinsignificantcostsavings.
Thiswhitepaperoutlinesbestpracticestoimprove yourDRprocesses,fromminimal investmentstofull-scaleavailability
and faulttolerance,andshowsyouhowyoucan use AWS servicestoreduce costand ensure businesscontinuityduring
a DR event.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 4 of 22
Recovery Time Objective and Recovery Point Objective
Thiswhitepaperusestwocommonindustrytermsfordisasterplanning:
Recoverytime objective (RTO)1
— The time it takesaftera disruptiontorestore abusinessprocesstoitsservice level,as
definedbythe operational levelagreement(OLA).Forexample,if adisasteroccursat 12:00 PM(noon) andthe RTOis
eighthours,the DR processshouldrestore the businessprocesstothe acceptable servicelevel by8:00 PM.
Recoverypoint objective (RPO)2
— The acceptable amountof data lossmeasuredintime.Forexample, if adisaster
occurs at 12:00 PM (noon) and the RPOis one hour, the systemshould recoverall datathat wasin the systembefore
11:00 AM. Data losswill spanonlyone hour,between11:00 AMand 12:00 PM(noon).
A company typicallydecidesonanacceptable RTOand RPObasedon the financial impacttothe businesswhensystems
are unavailable.The companydetermines financial impactbyconsideringmanyfactors,suchasthe lossof businessand
damage to itsreputationdue todowntime andthe lackof systemsavailability.
IT organizationsthenplansolutionstoprovide cost-effectivesystemrecoverybasedonthe RPOwithinthe timeline and
the service levelestablishedbythe RTO.
Traditional DR Investment Practices
A traditional approachtoDR involvesdifferentlevelsof off-site duplicationof dataand infrastructure.Critical business
servicesare setupand maintainedonthisinfrastructure andtestedatregularintervals. The disasterrecovery
environment’slocationandthe source infrastructure shouldbe asignificantphysical distance aparttoensure thatthe
disasterrecoveryenvironmentisisolatedfromfaultsthatcouldimpactthe source site.
At a minimum,the infrastructurethatisrequired tosupportthe duplicate environment should include the following:
 Facilitiestohouse the infrastructure,includingpowerandcooling.
 Securitytoensure the physical protectionof assets.
 Suitable capacitytoscale the environment.
 Supportfor repairing,replacing,andrefreshing the infrastructure.
 Contractual agreementswithanInternet service provider(ISP)toprovide Internetconnectivitythatcansustain
bandwidthutilizationforthe environmentunder afull load.
 Networkinfrastructure suchasfirewalls,routers,switches,andloadbalancers.
 Enoughservercapacityto run all mission-critical services,includingstorage appliancesforthe supportingdata,
and serverstorun applicationsandbackendservicessuchasuserauthentication, DomainNameSystem(DNS),
DynamicHost ConfigurationProtocol (DHCP),monitoring,andalerting.
1 From http://en.wikipedia.org/wiki/Recovery_time_objective
2 From http://en.wikipedia.org/wiki/Recovery_point_objective
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 5 of 22
AWS Services and Features Essential for Disaster Recovery
Before we discussthe variousapproachestoDR,it isimportantto review the AWSservicesandfeaturesthatare the
mostrelevanttodisasterrecovery.Thissection providesasummary.
In the preparationphase of DR,it is importantto considerthe use of servicesandfeaturesthatsupportdatamigration
and durable storage, because theyenable youtorestore backed-up,critical datatoAWS whendisasterstrikes.For some
of the scenariosthatinvolve eitherascaled-downora fully scaleddeploymentof yoursysteminAWS,compute
resourceswill be requiredaswell.
Whenreactingto a disaster,itisessential toeitherquicklycommissioncompute resourcestorunyour systeminAWSor
to orchestrate the failovertoalreadyrunningresourcesinAWS.The essentialinfrastructure piecesinclude DNS,
networkingfeatures,andvarious AmazonElasticComputeCloud (AmazonEC2) featuresdescribed laterinthissection.
Regions
AmazonWebServicesare available inmultiple regionsaroundthe globe,soyoucanchoose the most appropriate
locationforyour DR site,inadditiontothe site where yoursystemisfullydeployed.AWShasmultiplegeneralpurpose
regions inthe Americas,EMEA,and AsiaPacificthatanyone withan AWSaccount can access. Special-use regionsare
alsoavailable forgovernmentagenciesandforChina.See the full listof available regions here.
Storage
AmazonSimple Storage Service(AmazonS3) providesahighlydurable storage infrastructure designedformission-
critical and primarydata storage.Objectsare redundantlystoredonmultipledevicesacrossmultiplefacilitieswithina
region,designedto provide adurabilityof 99.999999999% (11 9s). AWSprovidesfurtherprotectionfordataretention
and archivingthroughversioninginAmazonS3,AWS multi-factorauthentication(AWSMFA),bucketpolicies,andAWS
IdentityandAccessManagement(IAM).
AmazonGlacierprovidesextremelylow-coststorage fordataarchivingand backup.Objects(or archives,astheyare
knowninAmazonGlacier) are optimizedforinfrequentaccess,forwhichretrievaltimesof several hoursare adequate.
AmazonGlacierisdesignedforthe same durabilityasAmazonS3.
AmazonElasticBlockStore (AmazonEBS) providesthe abilitytocreate point-in-time snapshotsof datavolumes. Youcan
use the snapshotsas the startingpointfor newAmazonEBS volumes,andyoucan protectyour data forlong-term
durabilitybecause snapshotsare storedwithinAmazonS3.Afteravolume iscreated, youcan attach itto a running
AmazonEC2 instance.AmazonEBSvolumesprovide off-instancestorage thatpersistsindependentlyfromthe life of an
instance and isreplicatedacrossmultipleserversinanAvailabilityZone topreventthe lossof data from the failure of
any single component.
AWS Import/Export acceleratesmovinglarge amountsof dataintoand out of AWS by usingportable storage devicesfor
transport.AWS Import/Exportbypassesthe Internetand transfersyourdatadirectlyontoand off of storage devicesby
meansof the high-speedinternal network of Amazon.Fordatasetsof significantsize,AWSImport/Exportisoftenfaster
than Internettransferandmore costeffective thanupgradingyourconnectivity.Youcanuse AWS Import/Exportto
migrate data intoandout of AmazonS3 bucketsandAmazonGlacier vaultsor intoAmazonEBS snapshots.
AWS Storage Gateway isa service thatconnectsan on-premisessoftware appliance withcloud-basedstorage toprovide
seamlessand highly secure integrationbetween youron-premisesITenvironmentand the storage infrastructure of
AWS.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 6 of 22
AWS Storage Gatewaysupportsthree differentconfigurations:
Gateway-cachedvolumes— You can store your primarydata inAmazonS3 andretainyour frequentlyaccesseddata
locally.Gateway-cachedvolumesprovide substantialcostsavingsonprimarystorage,minimize the needtoscale your
storage on-premises,andretainlow-latencyaccesstoyourfrequentlyaccesseddata.
Gateway-storedvolumes— Inthe eventthatyouneedlow-latencyaccess toyourentire dataset,youcan configure
your gatewaytostore your primarydata locally,andasynchronouslybackuppoint-in-timesnapshotsof thisdatato
AmazonS3. Gateway-storedvolumesprovidedurable andinexpensiveoff-site backupsthatyoucan recoverlocallyor
fromAmazonEC2 if,forexample,youneedreplacementcapacityfordisasterrecovery.
Gateway-virtual tape library (gateway-VTL) — Withgateway-VTL,youcanhave an almostlimitlesscollectionof virtual
tapes. You can store each virtual tape ina virtual tape library (VTL) backedbyAmazonS3 or a virtual tape shelf (VTS)
backedby AmazonGlacier.The virtual tape libraryexposesanindustrystandardiSCSIinterface thatprovidesyour
backupapplicationwithon-line accesstothe virtual tapes.Whenyounolongerrequire immediate orfrequentaccessto
data containedona virtual tape,youcan use your backupapplicationtomove itfromits VTL toyour VTS to further
reduce yourstorage costs.
Compute
AmazonElasticCompute Cloud (AmazonEC2) providesresizable compute capacityinthe cloud. Withinminutes,youcan
create AmazonEC2 instances,whichare virtual machinesoverwhichyouhave completecontrol.Inthe contextof DR,
the abilitytorapidlycreate virtual machinesthatyoucancontrol iscritical.To describe everyfeature of AmazonEC2is
outside the scope of thisdocument; instead;we focusonthe aspectsof AmazonEC2 that are mostrelevanttoDR.
AmazonMachine Images(AMIs) are preconfiguredwithoperatingsystems,andsome preconfiguredAMIsmightalso
include applicationstacks.Youcan alsoconfigure yourownAMIs.In the contextof DR, we stronglyrecommendthatyou
configure andidentify yourownAMIsso that theycan launchas part of yourrecoveryprocedure.SuchAMIsshouldbe
preconfiguredwithyouroperatingsystemof choice plusappropriate piecesof the applicationstack.
AvailabilityZones are distinctlocationsthatare engineeredtobe insulatedfromfailuresinotherAvailabilityZones.They
alsoprovide inexpensive,low-latencynetworkconnectivitytootherAvailabilityZonesinthe same region.Bylaunching
instancesinseparate AvailabilityZones,youcanprotectyourapplicationsfromthe failure of asingle location. Regions
consistof one or more AvailabilityZones.
The AmazonEC2 VMImport Connectorvirtual appliance enablesyoutoimportvirtual machine imagesfromyour
existingenvironmenttoAmazonEC2 instances.
Networking
Whenyouare dealingwithadisaster,it’sverylikelythatyouwill have tomodifynetworksettingsasyoursystemis
failingovertoanothersite. AWSoffersseveralservices andfeatures thatenableyou tomanage and modifynetwork
settings.
AmazonRoute 53 isa highlyavailable andscalable DomainNameSystem(DNS)webservice.Itgivesdevelopersand
businessesareliable, cost-effective waytoroute users toInternetapplications. AmazonRoute 53includesanumberof
global load-balancingcapabilities (whichcanbe effective when youare dealingwithDRscenariossuchasDNS endpoint
healthchecks) andthe abilitytofailoverbetweenmultiple endpointsand evenstaticwebsiteshostedin AmazonS3.
ElasticIP addresses are staticIPaddressesdesignedfordynamiccloudcomputing. However,unlike traditional staticIP
addresses,ElasticIPaddressesenableyoutomaskinstance or AvailabilityZone failuresby programmaticallyremapping
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 7 of 22
your publicIPaddressestoinstancesinyouraccountina particularregion.ForDR, youcan alsopre-allocate some IP
addressesforthe mostcritical systemssothat theirIPaddressesare alreadyknownbefore disasterstrikes. Thiscan
simplifythe executionof the DRplan.
ElasticLoad Balancingautomaticallydistributesincomingapplicationtrafficacrossmultiple AmazonEC2instances.It
enablesyoutoachieve evengreaterfault tolerance inyourapplicationsby seamlesslyprovidingthe load-balancing
capacity that is neededinresponsetoincomingapplicationtraffic.Justasyoucan pre-allocate ElasticIPaddresses,you
can pre-allocate yourloadbalancersothatits DNSname is alreadyknown,whichcansimplifythe executionof yourDR
plan.
AmazonVirtual Private Cloud (AmazonVPC) letsyouprovisionaprivate,isolatedsectionof the AWScloudwhere you
can launchAWS resourcesina virtual networkthatyoudefine.Youhave completecontrol overyourvirtual networking
environment,includingselectionof yourownIPaddressrange,creationof subnets,andconfigurationof route tables
and networkgateways.Thisenablesyoutocreate a VPN connectionbetweenyourcorporate data centerand yourVPC,
and leverage the AWScloudasan extensionof yourcorporate data center.Inthe contextof DR, you can use Amazon
VPCto extendyourexistingnetworktopologytothe cloud;thiscan be especiallyappropriate whenrecovering
enterprise applicationsthatare typicallyonthe internal network.
AmazonDirectConnectmakesiteasyto setup a dedicatednetworkconnectionfromyourpremisestoAWS.Inmany
cases,thiscan reduce yournetworkcosts,increase bandwidththroughput,andprovide amore consistentnetwork
experience thanInternet-basedconnections.
Databases
For yourdatabase needs,considerusingtheseAWSservices:
AmazonRelational Database Service (AmazonRDS) makesiteasytosetup, operate,andscale a relational database in
the cloud.You can use AmazonRDS eitherinthe preparationphase forDR to holdyourcritical data ina database thatis
already running,orinthe recoveryphase torun yourproductiondatabase.When youwantto lookat multiple regions,
AmazonRDS givesyouthe abilitytosnapshotdatafromone regionto another,andalsoto have a readreplicarunningin
anotherregion.
AmazonDynamoDBisa fast,fullymanagedNoSQLdatabase service thatmakesitsimpleandcost-effective tostore and
retrieve anyamountof data andserve any level of requesttraffic.Ithasreliable throughputandsingle-digit, millisecond
latency. Youcan also use itin the preparationphase tocopydata to DynamoDBin anotherregionorto AmazonS3.
Duringthe recoveryphase of DR, youcan scale upseamlesslyinamatterof minuteswithasingle clickorAPIcall.
AmazonRedshift isafast,fullymanaged,petabyte-scaledatawarehouse service thatmakesitsimple andcost-effective
to efficientlyanalyze all yourdatausingyourexistingbusinessintelligence tools. Youcan use AmazonRedshiftinthe
preparationphase tosnapshotyourdata warehouse tobe durablystoredinAmazonS3 withinthe same regionor
copiedtoanotherregion.Duringthe recoveryphase of DR,you can quicklyrestore yourdatawarehouse intothe same
regionor withinanotherAWSregion.
You can alsoinstall andrun yourchoice of database software onAmazonEC2, andyou can choose froma varietyof
leadingdatabase systems.
For more informationaboutdatabase optionsonAWS,see RunningDatabasesonAWS.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 8 of 22
Deploymentorchestration
Deploymentautomationandpost-startupsoftware installation/configurationprocessesandtoolscanbe usedin
AmazonEC2. We highlyrecommendinvestmentsinthisarea. Thiscan be veryhelpful inthe recoveryphase,enabling
youto create the requiredsetof resourcesinanautomated way.
AWS CloudFormation givesdevelopersandsystemsadministratorsaneasywayto create a collectionof relatedAWS
resourcesandprovisiontheminanorderlyandpredictable fashion.Youcancreate templatesforyourenvironments
and deployassociatedcollectionsof resources(calledastack) as needed.
AWS ElasticBeanstalk isaneasy-to-use service fordeployingandscalingwebapplicationsandservicesdevelopedwith
Java,.NET, PHP,Node.js,Python,Ruby,andDocker. Youcan deployyourapplicationcode, andAWSElasticBeanstalk
will provisionthe operatingenvironmentforyourapplications.
AWS OpsWorks isan applicationmanagementservice thatmakesiteasytodeployandoperate applicationsof all types
and sizes. Youcan define yourenvironmentasa seriesof layers,andconfigure eachlayerasa tierof your application.
AWS OpsWorkshasautomatichost replacement,sointhe eventof aninstance failure it will be automaticallyreplaced.
You can use AWS OpsWorksinthe preparationphase totemplate yourenvironment,andyoucan combine it withAWS
CloudFormationinthe recoveryphase.Youcanquicklyprovisionanew stack fromthe storedconfiguration that
supports the definedRTO.
Security and compliance
There are manysecurity-relatedfeaturesacrossthe AWS services. We recommend thatyoureview the Security Best
Practiceswhitepaper.AWSalsoprovidesfurtherriskandcompliance informationinthe AWSSecurityCenter.A full
discussionof securityisoutof scope forthis paper.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 9 of 22
Example Disaster Recovery Scenarios with AWS
ThissectionoutlinesfourDRscenariosthathighlightthe use of AWSandcompare AWS withtraditional DR methods.
The followingfigure shows aspectrumforthe fourscenarios, arrangedbyhow quicklya systemcanbe available tousers
aftera DR event.
Figure1: Spectrum ofDisaster Recovery Options
AWS enables youtocost-effectivelyoperateeachof these DRstrategies.It’simportanttonote thatthese are just
examplesof possibleapproaches,andvariationsandcombinationsof these are possible. If yourapplicationisalready
runningonAWS, thenmultipleregionscanbe employedandthe same DRstrategies will still apply.
Backup and Restore
In mosttraditional environments,dataisbackedupto tape and sentoff-site regularly. If youuse thismethod,itcantake
a longtime to restore yoursysteminthe eventof a disruptionordisaster. AmazonS3isan ideal destinationforbackup
data that mightbe neededquicklytoperformarestore.Transferringdatato andfrom AmazonS3 is typicallydone
throughthe network, andistherefore accessible fromanylocation.There are manycommercial andopen-source
backupsolutionsthat integrate with AmazonS3. Youcan use AWSImport/Exporttotransferverylarge data setsby
shippingstorage devicesdirectlytoAWS.Forlonger-termdatastorage where retrieval timesof several hoursare
adequate,there isAmazonGlacier,whichhasthe same durabilitymodel asAmazonS3.AmazonGlacierisa low-cost
alternative startingfrom$0.01/GB per month.AmazonGlacierandAmazonS3 can be usedinconjunctiontoproduce a
tieredbackupsolution.
AWS Storage Gatewayenables snapshotsof youron-premisesdatavolumestobe transparentlycopiedinto AmazonS3
for backup.You can subsequently create local volumesorAmazonEBS volumesfromthese snapshots.
Storage-cachedvolumesallow youtostore yourprimarydata in AmazonS3, but keepyourfrequentlyaccesseddata
local for low-latencyaccess.AswithAWSStorage Gateway, youcan snapshot the data volumes togive highlydurable
backup.In the eventof DR, youcan restore the cache volumeseithertoa secondsite runningastorage cache gateway
or to AmazonEC2.
You can use the gateway-VTLconfigurationof AWSStorage Gatewayasa backup targetfor yourexistingbackup
managementsoftware. Thiscanbe usedas a replacementfortraditional magnetictape backup.
For systemsrunningonAWS, youalsocan back up intoAmazonS3. Snapshotsof AmazonEBS volumes,AmazonRDS
databases,andAmazonRedshift datawarehousescanbe storedinAmazonS3. Alternatively,youcancopy filesdirectly
intoAmazonS3, or you can choose to create backupfilesandcopythose to AmazonS3. There are many backup
solutionsthatstore datadirectlyinAmazonS3, andthese can be usedfromAmazonEC2 systemsaswell.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 10 of 22
The followingfigure showsdatabackupoptionstoAmazonS3,from eitheron-siteinfrastructure orfromAWS.
Figure 2: Data Backup Options to Amazon S3 from On-SiteInfrastructureor from AWS.
Of course,the backupof your data isonlyhalf of the story. If disasterstrikes,you’llneedtorecoveryourdata quickly
and reliably. Youshouldensure that yoursystemsare configuredto retainandsecure yourdata,and youshould test
your data recoveryprocesses.
The followingdiagramshowshowyoucanquicklyrestore asystemfromAmazonS3 backupsto AmazonEC2.
Figure 3: Restoring a System from Amazon S3 Backups to Amazon EC2
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 11 of 22
Keystepsforbackup andrestore:
1. Selectanappropriate tool or methodtoback up yourdata intoAWS.
2. Ensure that youhave an appropriate retentionpolicyforthisdata.
3. Ensure that appropriate securitymeasuresare inplace forthisdata, includingencryptionandaccesspolicies.
4. Regularlytestthe recoveryof thisdataand the restorationof yoursystem.
Pilot Light for Quick Recovery intoAWS
The term pilot light isoftenusedto describe aDR scenario inwhicha minimal versionof anenvironmentisalways
runninginthe cloud. The ideaof the pilotlightisananalogythat comesfromthe gas heater.Ina gas heater,a small
flame that’salwaysoncan quicklyignite the entirefurnace toheatupa house.
Thisscenarioissimilartoa backup-and-restore scenario.Forexample,withAWSyoucan maintainapilotlightby
configuringandrunningthe mostcritical core elementsof yoursysteminAWS.Whenthe time comesforrecovery,you
can rapidlyprovisionafull-scale productionenvironmentaroundthe critical core.
Infrastructure elementsforthe pilotlightitself typicallyincludeyourdatabase servers,whichwouldreplicatedatato
AmazonEC2 or AmazonRDS.Dependingonthe system, there mightbe othercritical dataoutside of the database that
needstobe replicatedtoAWS.Thisis the critical core of the system(the pilotlight) aroundwhichall other
infrastructure piecesinAWS (the restof the furnace) canquicklybe provisionedtorestore the completesystem.
To provisionthe remainderof the infrastructure torestore business-critical services,youwouldtypicallyhave somepre-
configuredserversbundledasAmazonMachine Images(AMIs),whichare readytobe started upat a moment’snotice.
Whenstartingrecovery,instancesfromthese AMIscome upquickly withtheirpre-definedrole(forexample, Webor
AppServer) withinthe deploymentaroundthe pilotlight. Fromanetworkingpointof view,you have twomainoptions
for provisioning:
 Use ElasticIPaddresses,whichcanbe pre-allocatedand identified inthe preparationphase forDR,and
associate themwithyourinstances.Note thatforMAC address-basedsoftware licensing,youcanuse elastic
networkinterfaces(ENIs),whichhave aMAC addressthat can alsobe pre-allocatedtoprovisionlicensesagainst.
You can associate these withyourinstances,justasyouwouldwithElasticIPaddresses.
 Use ElasticLoad Balancing(ELB) to distribute traffictomultiple instances.YouwouldthenupdateyourDNS
recordsto pointat your AmazonEC2 instance or pointto your loadbalancerusinga CNAME. We recommend
thisoptionfortraditional web-basedapplications.
For lesscritical systems,youcanensure thatyouhave any installationpackagesandconfigurationinformationavailable
inAWS, forexample,inthe formof an AmazonEBS snapshot.Thiswill speedupthe applicationserversetup,because
youcan quicklycreate multiple volumesinmultipleAvailabilityZonestoattachto AmazonEC2 instances.Youcan then
install andconfigure accordingly,forexample,byusingthe backup-and-restoremethod.
The pilotlightmethodgivesyouaquickerrecovery time thanthe backup-and-restore methodbecause the core pieces
of the systemare alreadyrunningandare continuallykeptuptodate.AWS enablesyoutoautomate the provisioning
and configurationof the infrastructure resources,whichcanbe a significantbenefittosave time andhelpprotect
againsthumanerrors. However,youwill stillneedtoperformsome installationandconfigurationtaskstorecoverthe
applicationsfully.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 12 of 22
Preparation phase
The followingfigure showsthe preparationphase,inwhichyouneedtohave yourregularlychangingdatareplicatedto
the pilotlight,the small core aroundwhichthe full environmentwillbe startedinthe recoveryphase.Yourless
frequentlyupdateddata,suchasoperatingsystemsandapplications,canbe periodicallyupdatedandstoredasAMIs.
Figure 4: ThePreparation PhaseofthePilot Light Scenario
Key stepsforpreparation:
1. Setup AmazonEC2 instancestoreplicate ormirrordata.
2. Ensure that youhave all supportingcustomsoftware packagesavailable inAWS.
3. Create and maintainAMIsof key serverswhere fastrecoveryisrequired.
4. Regularlyrunthese servers,testthem, andapplyanysoftware updatesandconfigurationchanges.
5. Considerautomatingthe provisioningof AWSresources.
Recoveryphase
To recoverthe remainderof the environmentaroundthe pilotlight,you canstart your systemsfromthe AMIs within
minutesonthe appropriate instance types.Foryourdynamicdata servers,youcanresize themtohandle production
volumesasneededoradd capacityaccordingly.Horizontal scalingoften isthe mostcost-effectiveandscalable approach
to add capacityto a system.Forexample,youcan add more webserversatpeaktimes.However, youcan alsochoose
largerAmazonEC2 instance types,andthusscale vertically forapplicationssuchasrelational databases.Froma
networkingperspective,anyrequiredDNSupdatescanbe done inparallel.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 13 of 22
Afterrecovery,youshouldensure thatredundancyisrestoredasquicklyaspossible.A failure of yourDRenvironment
shortlyafteryourproductionenvironmentfailsisunlikely, butyoushould be aware of thisrisk.Continue totake regular
backupsof yoursystem,andconsideradditionalredundancyatthe data layer.
The followingfigure showsthe recoveryphase of the pilotlightscenario.
Figure 5: TheRecovery PhaseofthePilot Light Scenario.
Key stepsforrecovery:
1. Start your applicationAmazonEC2instancesfromyourcustomAMIs.
2. Resize existingdatabase/datastore instancestoprocessthe increased traffic.
3. Addadditional database/datastore instancestogive the DRsite resilience inthe datatier;if youare using
AmazonRDS, turnon Multi-AZtoimprove resilience.
4. Change DNSto pointat the AmazonEC2 servers.
5. Install andconfigure anynon-AMIbasedsystems,ideallyinanautomated way.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 14 of 22
Warm Standby Solutionin AWS
The term warmstandby isusedtodescribe aDR scenario inwhicha scaled-downversionof afullyfunctional
environmentisalwaysrunninginthe cloud. A warmstandbysolutionextendsthe pilotlightelementsandpreparation.It
furtherdecreasesthe recoverytime because some servicesare alwaysrunning.Byidentifyingyourbusiness-critical
systems,you canfullyduplicate thesesystemsonAWSandhave themalwayson.
These serverscanbe runningon a minimum-sizedfleetof AmazonEC2instancesonthe smallestsizespossible.This
solutionisnotscaledtotake a full-productionload,butitisfullyfunctional.It canbe usedfornon-productionwork,
such as testing,qualityassurance,andinternal use.
In a disaster,the systemisscaledupquicklytohandle the productionload.InAWS,thiscanbe done by addingmore
instancestothe loadbalancerand by resizingthe small capacityserverstorunonlargerAmazonEC2 instance types.As
statedinthe precedingsection,horizontal scalingispreferredoververtical scaling.
Preparation phase
The followingfigure showsthe preparationphase forawarm standbysolution,inwhichanon-site solution andanAWS
solutionrunside-by-side.
Figure 6: ThePreparation Phaseofthe Warm Standby Scenario.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 15 of 22
Key stepsforpreparation:
1. Setup AmazonEC2 instancestoreplicate ormirrordata.
2. Create and maintainAMIs.
3. Run yourapplicationusingaminimal footprintof AmazonEC2instances or AWSinfrastructure.
4. Patch and update software andconfigurationfilesinlinewithyourlive environment.
Recoveryphase
In the case of failure of the productionsystem,the standbyenvironmentwillbe scaledupforproductionload,andDNS
records will be changedtoroute all traffictoAWS.
Figure 7: TheRecovery Phaseofthe Warm Standby Scenario.
Key stepsforrecovery:
1. Increase the size of the AmazonEC2 fleetsinservice withthe loadbalancer(horizontal scaling).
2. Start applicationsonlargerAmazonEC2 instance typesasneeded(vertical scaling).
3. Eithermanuallychange the DNSrecords,or use AmazonRoute 53 automatedhealthcheckssothatall trafficis
routedto the AWS environment.
4. ConsiderusingAutoScalingtoright-sizethe fleetoraccommodate the increasedload.
5. Addresilience orscale upyour database.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 16 of 22
Multi-Site Solution Deployedon AWS and On-Site
A multi-site solutionrunsinAWSaswell ason your existingon-site infrastructure,inanactive-active configuration.The
data replicationmethodthatyouemploywill be determinedbythe recovery pointthatyouchoose. Formore
information aboutrecoverypointoptions,seethe RecoveryTime Objective andRecoveryPointObjectivesectioninthis
whitepaper.
In additiontorecoverypointoptions,thereare variousreplicationmethods, suchassynchronousandasynchronous
methods. Formore information,seethe Replicationof Datasectioninthiswhitepaper.
You can use a DNSservice thatsupportsweightedrouting,suchas AmazonRoute 53, to route productiontrafficto
differentsites thatdeliverthe same applicationorservice.A proportionof trafficwill gotoyour infrastructure inAWS,
and the remainderwill gotoyouron-site infrastructure.
In an on-site disastersituation,youcanadjustthe DNSweightingandsendall traffictothe AWS servers.The capacityof
the AWS service canbe rapidlyincreasedtohandle the full productionload. Youcanuse AmazonEC2 AutoScalingto
automate thisprocess.Youmightneedsome applicationlogictodetectthe failure of the primarydatabase servicesand
cut overto the parallel database servicesrunninginAWS.
The cost of thisscenarioisdeterminedbyhowmuchproductiontrafficishandledbyAWSduring normal operation.In
the recoveryphase,youpay only for whatyouuse forthe durationthat the DR environmentisrequiredatfull scale.You
can furtherreduce costby purchasing AmazonEC2 ReservedInstancesforyour“alwayson”AWSservers.
Preparation phase
The followingfigure showshow youcanuse the weightedroutingpolicy of the AmazonRoute 53 DNS to route a portion
of yourtrafficto the AWS site.The applicationonAWSmightaccessdata sourcesinthe on-site productionsystem.Data
isreplicatedormirroredto the AWSinfrastructure.
Figure 8: ThePreparation PhaseoftheMulti-SiteScenario.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 17 of 22
Key stepsforpreparation:
1. Setup your AWSenvironmenttoduplicate yourproductionenvironment.
2. Setup DNS weighting, orsimilartrafficroutingtechnology,todistributeincomingrequeststobothsites.
Configure automatedfailovertore-route trafficawayfromthe affectedsite.
Recoveryphase
The followingfigure shows the change in trafficroutinginthe eventof anon-site disaster. Trafficiscutoverto the AWS
infrastructure byupdatingDNS,andall trafficand supportingdataqueriesare supportedbythe AWSinfrastructure.
Figure 9: TheRecovery Phaseofthe Multi-SiteScenario Involving On-Siteand AWS Infrastructure.
Key stepsforrecovery:
1. Eithermanuallyorby usingDNSfailover, change the DNSweightingsothatall requestsare senttothe AWS site.
2. Have applicationlogicforfailovertouse the local AWS database servers forall queries.
3. ConsiderusingAutoScalingtoautomaticallyright-size the AWSfleet.
You can furtherincrease the availabilityof yourmulti-site solutionbydesigningMulti-AZarchitectures.Formore
informationabouthowtodesignapplicationsthatspanmultipleavailabilityzones,see the BuildingFault-Tolerant
ApplicationsonAWS whitepaper.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 18 of 22
AWS Productionto an AWS DR SolutionUsing Multiple AWS Regions
ApplicationsdeployedonAWShave multi-site capability bymeansof multipleAvailabilityZones.Availability Zonesare
distinctlocationsthatare engineeredtobe insulatedfrom eachother.They provide inexpensive,low-latencynetwork
connectivity withinthe same region.
Some applicationsmighthave anadditionalrequirementtodeploytheircomponentsusingmultiple regions;thiscanbe
a businessorregulatoryrequirement.
Anyof the precedingscenarios inthiswhitepapercanbe deployedusingseparate AWSregions.The advantagesforboth
productionandDR scenariosinclude the following:
 You don’tneedtonegotiate contractswithanotherproviderinanotherregion
 You can use the same underlyingAWS technologiesacrossregions
 You can use the same toolsor APIs
For more information,see the MigratingAWSResourcestoaNew Region whitepaper.
Replication of Data
Whenyoureplicate datato a remote location, youshouldconsiderthese factors:
 Distance betweenthe sites — Larger distancestypicallyare subjecttomore latencyorjitter.
 Available bandwidth— The breadthand variability of the interconnections.
 Data rate requiredby your application — The data rate shouldbe lowerthanthe available bandwidth.
 Replicationtechnology— The replication technologyshouldbe parallel (sothatitcan use the network
effectively).
There are twomainapproaches for replicatingdata:synchronousandasynchronous.
Synchronous replication
Data is atomicallyupdatedinmultiple locations.Thisputsadependency onnetworkperformance andavailability. In
AWS, Availability Zoneswithinaregionare well connected,butphysicallyseparated. Forexample,whendeployedin
Multi-AZmode,AmazonRDSusessynchronousreplicationtoduplicate dataina secondAvailabilityZone.Thisensures
that data isnot lostif the primaryAvailabilityZone becomesunavailable.
Asynchronousreplication
Data is notatomically updatedinmultiple locations.Itis transferred asnetworkperformance andavailabilityallows,and
the applicationcontinuestowrite datathat mightnotbe fullyreplicatedyet.
Many database systemssupportasynchronousdatareplication.The database replicacanbe locatedremotely,andthe
replicadoesnothave to be completelysynchronizedwiththe primarydatabase server.Thisisacceptable inmany
scenarios,forexample,asa backupsource or reporting/read-onlyuse cases. Inadditionto database systems, youcan
alsoextend ittonetworkfile systemsanddatavolumes.
We recommendthatyou understandthe replicationtechnologyusedin yoursoftware solution.A detailedanalysisof
replicationtechnologyisbeyond the scope of thispaper.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 19 of 22
AWS regionsare completelyindependentof eachother,butthere are no differencesinthe wayyouaccessthemand
use them.Thisenables youtocreate DR processesthatspan continental distances,withoutthe challengesorcoststhat
thiswouldnormallyincur. Youcan back updata and systemstotwo or more AWS regions, allowingservice restoration
eveninthe face of extremelylarge-scaledisasters. Youcanuse AWS regionstoserve yourusers aroundthe globe with
relativelylowcomplexityto youroperational processes.
Failing Back from a Disaster
Once you have restoredyourprimary site toa workingstate,youwill needto restore yournormal service,whichisoften
referredtoas a “fail back.” DependingonyourDR strategy,thistypically meansreversingthe flowof datareplicationso
that any data updatesreceivedwhile the primarysite wasdowncanbe replicatedback,withoutthe lossof data. The
followingsteps outline the differentfail-backapproaches:
Backup and restore
1. Freeze datachangesto the DR site.
2. Take a backup.
3. Restore the backupto the primarysite.
4. Re-pointuserstothe primarysite.
5. Unfreeze the changes.
Pilotlight, warm standby, and multi-site
1. Establishreverse mirroring/replicationfromthe DRsite backto the primarysite,once the primarysite has
caught up withthe changes.
2. Freeze datachangesto the DR site.
3. Re-pointuserstothe primarysite.
4. Unfreeze the changes.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 20 of 22
Improving Your DR Plan
Thissectiondescribesthe importantstepsyoushouldfollow toestablisha strongDR plan.
Testing
AfteryourDR solutionisinplace,itneedstobe tested. You can testfrequently,whichisone of the keyadvantagesof
deployingonAWS.“Game day”is whenyouexercise afailovertothe DR environment,ensuringthatsufficient
documentationisinplace tomake the processas simple aspossible shouldthe real eventtake place.Spinningupa
duplicate environmentfortestingyourgame-dayscenariosisquickandcost-effectiveonAWS,andyou typicallydon’t
needtotouch yourproductionenvironment.Youcanuse AWS CloudFormation todeploycomplete environmentson
AWS.This usesa template todescribe the AWSresourcesandanyassociateddependenciesorruntime parameters that
are requiredtocreate a full environment.
Differentiatingyourtestsiskeytoensuringthatyouare coveredagainsta multitude of differenttypesof disasters.The
followingare examplesof possible game-dayscenarios:
 Powerlosstoa site or a setof servers
 Loss of ISP connectivitytoasingle site
 Virusimpactingcore businessservices thataffects multi-sites
 User error thatcauses the lossof data, requiringapoint-in-time recovery
Monitoringand alerting
You needtohave regularchecksand sufficient monitoringinplace toalertyou whenyourDR environmenthasbeen
impactedbyserverfailure,connectivityissues,andapplicationissues. AmazonCloudWatch providesaccesstometrics
aboutAWS resources,aswell as custommetrics that can be application–centricorevenbusiness-centric.Youcansetup
alarmsbasedon definedthresholdsonanyof the metricsand,where required,youcansetup AmazonSNSto send
alertsincase of unexpectedbehavior.
You can use any monitoringsolutionsonAWS,andyoucan also continue touse anyexistingmonitoringandalerting
toolsthat yourcompanyusesto monitoryourinstance metrics,aswell asguestOS statsand applicationhealth.
Backups
Afteryouhave switched toyourDR environment,youshouldcontinuetomake regularbackups.Testingbackupand
restore regularlyisessential asafall-backsolution.
AWS givesyouthe flexibilityto performfrequent,inexpensive DRtestswithoutneedingthe DRinfrastructure tobe
“alwayson.”
Useraccess
You can secure accessto resourcesinyour DR environmentbyusing AWSIdentityandAccessManagement (IAM). With
IAM, youcan create role-basedand user-basedsecuritypoliciesthatsegregateuserresponsibilities andrestrictuser
access to specified resourcesand tasksinyourDR environment.
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 21 of 22
Systemaccess
You can alsocreate rolesforyourAmazonEC2 resources,sothat onlyuserswhoare assignedtospecifiedroles can
performdefinedactionsonyourDR environment,suchasaccessingan AmazonS3 bucketor re-pointinganElasticIP
address.
Automation
You can automate the deploymentof applicationsontoAWS-basedserversandyouron-premisesserversby using
configurationmanagementororchestrationsoftware.Thisallowsyoutohandle applicationandconfigurationchange
managementacrossbothenvironmentswithease.There are several popularorchestrationsoftware optionsavailable.
For a listof solutionproviders,seethe AWSPartnerDirectory.3
AWS CloudFormation worksinconjunctionwithseveral toolstoprovisioninfrastructure servicesinanautomated way.
Higherlevelsof abstractionare alsoavailablewith AWSOpsWorks orAWSElasticBeanstalk.The overall goal isto
automate yourinstancesasmuch as possible.Formore information,see the Architectingforthe Cloud:BestPractices
whitepaper.
You can use AutoScalingto ensure thatyour pool of instancesisappropriatelysizedtomeetthe demandbasedonthe
metricsthat youspecifyinAWSCloudWatch.Thismeansthatina DR situation,asyouruserbase starts to use the
environmentmore,the solutioncanscale updynamicallytomeetthisincreaseddemand.Afterthe eventisoverand
usage potentiallydecreases,the solutioncanscale backdownto a minimumlevel of servers.
Software Licensing and DR
Ensuringthat youare correctlylicensedforyourAWSenvironmentisasimportantaslicensing foranyother
environment.AWSprovidesavarietyof modelstomake licensingeasierforyouto manage.Forexample,“BringYour
OwnLicense”ispossible forseveralsoftwarecomponentsoroperatingsystems.Alternately,there isarange of software
for whichthe cost of the license isincludedinthe hourlycharge.Thisisknownas“License included.”
“Bring yourOwnLicense”enablesyou toleverage yourexistingsoftwareinvestmentsduringadisaster.“License
included”minimizesup-frontlicensecostsfora DR site that doesn’tgetusedona day-to-daybasis.
If at anystage youare indoubtabout yourlicensesandhow theyapplytoAWS,contact your license reseller.
Conclusion
Many optionsandvariationsforDR exist.Thispaperhighlightssome of the common scenarios,rangingfromsimple
backupand restore to faulttolerant,multi-site solutions.AWSgivesyoufine-grainedcontrol andmanybuildingblocks
to buildthe appropriate DRsolution,givenyourDRobjectives(RTOandRPO) andbudget.The AWSservices are
available on-demand, andyoupay only for whatyouuse.Thisis a keyadvantage forDR, where significantinfrastructure
isneededquickly,butonlyinthe eventof a disaster.
Thiswhitepaperhasshownhow AWSprovidesflexible,cost-effective infrastructure solutions, enablingyoutohave a
more effectiveDRplan.
3 Solution providers can be found at http://aws.amazon.com/solutions/solution-providers/
Amazon Web Services – Using AWS for Disaster Recovery October2014
Page 22 of 22
Further Reading
 Amazon S3Getting Started Guide: http://docs.amazonwebservices.com/AmazonS3/latest/gsg/
 Amazon EC2Getting Started Guide:
http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/
 AWS PartnerDirectory(fora listof AWS solution providers):
http://aws.amazon.com/solutions/solution-providers/
 AWS SecurityandCompliance Center: http://aws.amazon.com/security/
 AWS Architecture Center: http://aws.amazon.com/architecture
 Whitepaper:DesigningFault-TolerantApplicationsinthe AWSCloud
 OtherAWS technical whitepapers:http://aws.amazon.com/whitepapers
Document Revisions
We’ve made the followingchanges tothiswhitepapersince itsoriginal publicationinJanuary,2012:
 Updatedinformationabout AWSregions
 Added informationabout newservices:AmazonGlacier,Amazon Redshift,AWSOpsWorks,AWSElastic
Beanstalk,andAmazonDynamoDB
 Added informationabout elasticnetworkinterfaces(ENIs)
 Addedinformationaboutvariousfeaturesof AWSservicesforDRscenariosusingmultiple AWSregions
 Addedinformationabout AWSStorage Gateway virtual tape libraries

Whte Paper: Using aws for disaster recovery

  • 1.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 1 of 22 Using Amazon Web Services for Disaster Recovery October 2014 Glen Robinson, Attila Narin, and Chris Elleman
  • 2.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 2 of 22 Contents Introduction...............................................................................................................................................................3 Recovery Time Objective and Recovery Point Objective ................................................................................................4 Traditional DR Investment Practices ............................................................................................................................4 AWS Services and Features Essential for Disaster Recovery...........................................................................................5 Example Disaster Recovery Scenarioswith AWS...........................................................................................................9 Backup and Restore ................................................................................................................................................9 Pilot Light for Quick Recovery into AWS .................................................................................................................11 Warm Standby Solution in AWS.............................................................................................................................14 Multi-Site Solution Deployed on AWS and On-Site..................................................................................................16 AWS Production to an AWS DR Solution Using Multiple AWS Regions......................................................................18 Replication of Data...................................................................................................................................................18 Failing Back from a Disaster.......................................................................................................................................19 Improving Your DR Plan ............................................................................................................................................20 Software Licensing and DR........................................................................................................................................21 Conclusion ...............................................................................................................................................................21 Further Reading........................................................................................................................................................22 Document Revisions.................................................................................................................................................22
  • 3.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 3 of 22 Abstract In the eventof a disaster,youcan quicklylaunchresourcesinAmazonWebServices(AWS) toensure business continuity.Thiswhitepaperhighlights AWSservicesandfeatures thatyoucan leverage foryour disasterrecovery (DR) processes tosignificantlyminimize the impactonyourdata,your system, andyouroverall businessoperations.The whitepaperalso includes scenarios thatshow you,step-by-step,how toimprove yourDRplan andleverage the full potential of the AWScloudfordisasterrecovery. Introduction Disasterrecovery (DR) isaboutpreparingforand recoveringfromadisaster.Anyeventthathasa negative impacton a company’s businesscontinuityorfinances couldbe termedadisaster.Thisincludes hardware orsoftware failure, a networkoutage, apoweroutage,physical damage toa building like fire orflooding,humanerror,orsome other significantevent. To minimize the impactof adisaster, companiesinvesttime andresourcestoplan andprepare, totrainemployees,and to documentandupdate processes. The amountof investmentforDRplanning fora particularsystemcanvary dramaticallydependingonthe cost of a potential outage. Companiesthathave traditionalphysical environments typically mustduplicate theirinfrastructure toensure the availabilityof spare capacityin the eventof adisaster.The infrastructure needstobe procured,installed,andmaintainedsothatitis ready to supportthe anticipatedcapacity requirements. Duringnormal operations,the infrastructure typically isunder-utilizedorover-provisioned. WithAmazonWeb Services(AWS),yourcompanycanscale up itsinfrastructure onan as-needed,pay-as-you-gobasis. You getaccess to the same highlysecure, reliable, andfastinfrastructure thatAmazonusestorunits ownglobal networkof websites.AWSalsogivesyou the flexibility toquickly change andoptimize resourcesduringaDR event, whichcan resultinsignificantcostsavings. Thiswhitepaperoutlinesbestpracticestoimprove yourDRprocesses,fromminimal investmentstofull-scaleavailability and faulttolerance,andshowsyouhowyoucan use AWS servicestoreduce costand ensure businesscontinuityduring a DR event.
  • 4.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 4 of 22 Recovery Time Objective and Recovery Point Objective Thiswhitepaperusestwocommonindustrytermsfordisasterplanning: Recoverytime objective (RTO)1 — The time it takesaftera disruptiontorestore abusinessprocesstoitsservice level,as definedbythe operational levelagreement(OLA).Forexample,if adisasteroccursat 12:00 PM(noon) andthe RTOis eighthours,the DR processshouldrestore the businessprocesstothe acceptable servicelevel by8:00 PM. Recoverypoint objective (RPO)2 — The acceptable amountof data lossmeasuredintime.Forexample, if adisaster occurs at 12:00 PM (noon) and the RPOis one hour, the systemshould recoverall datathat wasin the systembefore 11:00 AM. Data losswill spanonlyone hour,between11:00 AMand 12:00 PM(noon). A company typicallydecidesonanacceptable RTOand RPObasedon the financial impacttothe businesswhensystems are unavailable.The companydetermines financial impactbyconsideringmanyfactors,suchasthe lossof businessand damage to itsreputationdue todowntime andthe lackof systemsavailability. IT organizationsthenplansolutionstoprovide cost-effectivesystemrecoverybasedonthe RPOwithinthe timeline and the service levelestablishedbythe RTO. Traditional DR Investment Practices A traditional approachtoDR involvesdifferentlevelsof off-site duplicationof dataand infrastructure.Critical business servicesare setupand maintainedonthisinfrastructure andtestedatregularintervals. The disasterrecovery environment’slocationandthe source infrastructure shouldbe asignificantphysical distance aparttoensure thatthe disasterrecoveryenvironmentisisolatedfromfaultsthatcouldimpactthe source site. At a minimum,the infrastructurethatisrequired tosupportthe duplicate environment should include the following:  Facilitiestohouse the infrastructure,includingpowerandcooling.  Securitytoensure the physical protectionof assets.  Suitable capacitytoscale the environment.  Supportfor repairing,replacing,andrefreshing the infrastructure.  Contractual agreementswithanInternet service provider(ISP)toprovide Internetconnectivitythatcansustain bandwidthutilizationforthe environmentunder afull load.  Networkinfrastructure suchasfirewalls,routers,switches,andloadbalancers.  Enoughservercapacityto run all mission-critical services,includingstorage appliancesforthe supportingdata, and serverstorun applicationsandbackendservicessuchasuserauthentication, DomainNameSystem(DNS), DynamicHost ConfigurationProtocol (DHCP),monitoring,andalerting. 1 From http://en.wikipedia.org/wiki/Recovery_time_objective 2 From http://en.wikipedia.org/wiki/Recovery_point_objective
  • 5.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 5 of 22 AWS Services and Features Essential for Disaster Recovery Before we discussthe variousapproachestoDR,it isimportantto review the AWSservicesandfeaturesthatare the mostrelevanttodisasterrecovery.Thissection providesasummary. In the preparationphase of DR,it is importantto considerthe use of servicesandfeaturesthatsupportdatamigration and durable storage, because theyenable youtorestore backed-up,critical datatoAWS whendisasterstrikes.For some of the scenariosthatinvolve eitherascaled-downora fully scaleddeploymentof yoursysteminAWS,compute resourceswill be requiredaswell. Whenreactingto a disaster,itisessential toeitherquicklycommissioncompute resourcestorunyour systeminAWSor to orchestrate the failovertoalreadyrunningresourcesinAWS.The essentialinfrastructure piecesinclude DNS, networkingfeatures,andvarious AmazonElasticComputeCloud (AmazonEC2) featuresdescribed laterinthissection. Regions AmazonWebServicesare available inmultiple regionsaroundthe globe,soyoucanchoose the most appropriate locationforyour DR site,inadditiontothe site where yoursystemisfullydeployed.AWShasmultiplegeneralpurpose regions inthe Americas,EMEA,and AsiaPacificthatanyone withan AWSaccount can access. Special-use regionsare alsoavailable forgovernmentagenciesandforChina.See the full listof available regions here. Storage AmazonSimple Storage Service(AmazonS3) providesahighlydurable storage infrastructure designedformission- critical and primarydata storage.Objectsare redundantlystoredonmultipledevicesacrossmultiplefacilitieswithina region,designedto provide adurabilityof 99.999999999% (11 9s). AWSprovidesfurtherprotectionfordataretention and archivingthroughversioninginAmazonS3,AWS multi-factorauthentication(AWSMFA),bucketpolicies,andAWS IdentityandAccessManagement(IAM). AmazonGlacierprovidesextremelylow-coststorage fordataarchivingand backup.Objects(or archives,astheyare knowninAmazonGlacier) are optimizedforinfrequentaccess,forwhichretrievaltimesof several hoursare adequate. AmazonGlacierisdesignedforthe same durabilityasAmazonS3. AmazonElasticBlockStore (AmazonEBS) providesthe abilitytocreate point-in-time snapshotsof datavolumes. Youcan use the snapshotsas the startingpointfor newAmazonEBS volumes,andyoucan protectyour data forlong-term durabilitybecause snapshotsare storedwithinAmazonS3.Afteravolume iscreated, youcan attach itto a running AmazonEC2 instance.AmazonEBSvolumesprovide off-instancestorage thatpersistsindependentlyfromthe life of an instance and isreplicatedacrossmultipleserversinanAvailabilityZone topreventthe lossof data from the failure of any single component. AWS Import/Export acceleratesmovinglarge amountsof dataintoand out of AWS by usingportable storage devicesfor transport.AWS Import/Exportbypassesthe Internetand transfersyourdatadirectlyontoand off of storage devicesby meansof the high-speedinternal network of Amazon.Fordatasetsof significantsize,AWSImport/Exportisoftenfaster than Internettransferandmore costeffective thanupgradingyourconnectivity.Youcanuse AWS Import/Exportto migrate data intoandout of AmazonS3 bucketsandAmazonGlacier vaultsor intoAmazonEBS snapshots. AWS Storage Gateway isa service thatconnectsan on-premisessoftware appliance withcloud-basedstorage toprovide seamlessand highly secure integrationbetween youron-premisesITenvironmentand the storage infrastructure of AWS.
  • 6.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 6 of 22 AWS Storage Gatewaysupportsthree differentconfigurations: Gateway-cachedvolumes— You can store your primarydata inAmazonS3 andretainyour frequentlyaccesseddata locally.Gateway-cachedvolumesprovide substantialcostsavingsonprimarystorage,minimize the needtoscale your storage on-premises,andretainlow-latencyaccesstoyourfrequentlyaccesseddata. Gateway-storedvolumes— Inthe eventthatyouneedlow-latencyaccess toyourentire dataset,youcan configure your gatewaytostore your primarydata locally,andasynchronouslybackuppoint-in-timesnapshotsof thisdatato AmazonS3. Gateway-storedvolumesprovidedurable andinexpensiveoff-site backupsthatyoucan recoverlocallyor fromAmazonEC2 if,forexample,youneedreplacementcapacityfordisasterrecovery. Gateway-virtual tape library (gateway-VTL) — Withgateway-VTL,youcanhave an almostlimitlesscollectionof virtual tapes. You can store each virtual tape ina virtual tape library (VTL) backedbyAmazonS3 or a virtual tape shelf (VTS) backedby AmazonGlacier.The virtual tape libraryexposesanindustrystandardiSCSIinterface thatprovidesyour backupapplicationwithon-line accesstothe virtual tapes.Whenyounolongerrequire immediate orfrequentaccessto data containedona virtual tape,youcan use your backupapplicationtomove itfromits VTL toyour VTS to further reduce yourstorage costs. Compute AmazonElasticCompute Cloud (AmazonEC2) providesresizable compute capacityinthe cloud. Withinminutes,youcan create AmazonEC2 instances,whichare virtual machinesoverwhichyouhave completecontrol.Inthe contextof DR, the abilitytorapidlycreate virtual machinesthatyoucancontrol iscritical.To describe everyfeature of AmazonEC2is outside the scope of thisdocument; instead;we focusonthe aspectsof AmazonEC2 that are mostrelevanttoDR. AmazonMachine Images(AMIs) are preconfiguredwithoperatingsystems,andsome preconfiguredAMIsmightalso include applicationstacks.Youcan alsoconfigure yourownAMIs.In the contextof DR, we stronglyrecommendthatyou configure andidentify yourownAMIsso that theycan launchas part of yourrecoveryprocedure.SuchAMIsshouldbe preconfiguredwithyouroperatingsystemof choice plusappropriate piecesof the applicationstack. AvailabilityZones are distinctlocationsthatare engineeredtobe insulatedfromfailuresinotherAvailabilityZones.They alsoprovide inexpensive,low-latencynetworkconnectivitytootherAvailabilityZonesinthe same region.Bylaunching instancesinseparate AvailabilityZones,youcanprotectyourapplicationsfromthe failure of asingle location. Regions consistof one or more AvailabilityZones. The AmazonEC2 VMImport Connectorvirtual appliance enablesyoutoimportvirtual machine imagesfromyour existingenvironmenttoAmazonEC2 instances. Networking Whenyouare dealingwithadisaster,it’sverylikelythatyouwill have tomodifynetworksettingsasyoursystemis failingovertoanothersite. AWSoffersseveralservices andfeatures thatenableyou tomanage and modifynetwork settings. AmazonRoute 53 isa highlyavailable andscalable DomainNameSystem(DNS)webservice.Itgivesdevelopersand businessesareliable, cost-effective waytoroute users toInternetapplications. AmazonRoute 53includesanumberof global load-balancingcapabilities (whichcanbe effective when youare dealingwithDRscenariossuchasDNS endpoint healthchecks) andthe abilitytofailoverbetweenmultiple endpointsand evenstaticwebsiteshostedin AmazonS3. ElasticIP addresses are staticIPaddressesdesignedfordynamiccloudcomputing. However,unlike traditional staticIP addresses,ElasticIPaddressesenableyoutomaskinstance or AvailabilityZone failuresby programmaticallyremapping
  • 7.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 7 of 22 your publicIPaddressestoinstancesinyouraccountina particularregion.ForDR, youcan alsopre-allocate some IP addressesforthe mostcritical systemssothat theirIPaddressesare alreadyknownbefore disasterstrikes. Thiscan simplifythe executionof the DRplan. ElasticLoad Balancingautomaticallydistributesincomingapplicationtrafficacrossmultiple AmazonEC2instances.It enablesyoutoachieve evengreaterfault tolerance inyourapplicationsby seamlesslyprovidingthe load-balancing capacity that is neededinresponsetoincomingapplicationtraffic.Justasyoucan pre-allocate ElasticIPaddresses,you can pre-allocate yourloadbalancersothatits DNSname is alreadyknown,whichcansimplifythe executionof yourDR plan. AmazonVirtual Private Cloud (AmazonVPC) letsyouprovisionaprivate,isolatedsectionof the AWScloudwhere you can launchAWS resourcesina virtual networkthatyoudefine.Youhave completecontrol overyourvirtual networking environment,includingselectionof yourownIPaddressrange,creationof subnets,andconfigurationof route tables and networkgateways.Thisenablesyoutocreate a VPN connectionbetweenyourcorporate data centerand yourVPC, and leverage the AWScloudasan extensionof yourcorporate data center.Inthe contextof DR, you can use Amazon VPCto extendyourexistingnetworktopologytothe cloud;thiscan be especiallyappropriate whenrecovering enterprise applicationsthatare typicallyonthe internal network. AmazonDirectConnectmakesiteasyto setup a dedicatednetworkconnectionfromyourpremisestoAWS.Inmany cases,thiscan reduce yournetworkcosts,increase bandwidththroughput,andprovide amore consistentnetwork experience thanInternet-basedconnections. Databases For yourdatabase needs,considerusingtheseAWSservices: AmazonRelational Database Service (AmazonRDS) makesiteasytosetup, operate,andscale a relational database in the cloud.You can use AmazonRDS eitherinthe preparationphase forDR to holdyourcritical data ina database thatis already running,orinthe recoveryphase torun yourproductiondatabase.When youwantto lookat multiple regions, AmazonRDS givesyouthe abilitytosnapshotdatafromone regionto another,andalsoto have a readreplicarunningin anotherregion. AmazonDynamoDBisa fast,fullymanagedNoSQLdatabase service thatmakesitsimpleandcost-effective tostore and retrieve anyamountof data andserve any level of requesttraffic.Ithasreliable throughputandsingle-digit, millisecond latency. Youcan also use itin the preparationphase tocopydata to DynamoDBin anotherregionorto AmazonS3. Duringthe recoveryphase of DR, youcan scale upseamlesslyinamatterof minuteswithasingle clickorAPIcall. AmazonRedshift isafast,fullymanaged,petabyte-scaledatawarehouse service thatmakesitsimple andcost-effective to efficientlyanalyze all yourdatausingyourexistingbusinessintelligence tools. Youcan use AmazonRedshiftinthe preparationphase tosnapshotyourdata warehouse tobe durablystoredinAmazonS3 withinthe same regionor copiedtoanotherregion.Duringthe recoveryphase of DR,you can quicklyrestore yourdatawarehouse intothe same regionor withinanotherAWSregion. You can alsoinstall andrun yourchoice of database software onAmazonEC2, andyou can choose froma varietyof leadingdatabase systems. For more informationaboutdatabase optionsonAWS,see RunningDatabasesonAWS.
  • 8.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 8 of 22 Deploymentorchestration Deploymentautomationandpost-startupsoftware installation/configurationprocessesandtoolscanbe usedin AmazonEC2. We highlyrecommendinvestmentsinthisarea. Thiscan be veryhelpful inthe recoveryphase,enabling youto create the requiredsetof resourcesinanautomated way. AWS CloudFormation givesdevelopersandsystemsadministratorsaneasywayto create a collectionof relatedAWS resourcesandprovisiontheminanorderlyandpredictable fashion.Youcancreate templatesforyourenvironments and deployassociatedcollectionsof resources(calledastack) as needed. AWS ElasticBeanstalk isaneasy-to-use service fordeployingandscalingwebapplicationsandservicesdevelopedwith Java,.NET, PHP,Node.js,Python,Ruby,andDocker. Youcan deployyourapplicationcode, andAWSElasticBeanstalk will provisionthe operatingenvironmentforyourapplications. AWS OpsWorks isan applicationmanagementservice thatmakesiteasytodeployandoperate applicationsof all types and sizes. Youcan define yourenvironmentasa seriesof layers,andconfigure eachlayerasa tierof your application. AWS OpsWorkshasautomatichost replacement,sointhe eventof aninstance failure it will be automaticallyreplaced. You can use AWS OpsWorksinthe preparationphase totemplate yourenvironment,andyoucan combine it withAWS CloudFormationinthe recoveryphase.Youcanquicklyprovisionanew stack fromthe storedconfiguration that supports the definedRTO. Security and compliance There are manysecurity-relatedfeaturesacrossthe AWS services. We recommend thatyoureview the Security Best Practiceswhitepaper.AWSalsoprovidesfurtherriskandcompliance informationinthe AWSSecurityCenter.A full discussionof securityisoutof scope forthis paper.
  • 9.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 9 of 22 Example Disaster Recovery Scenarios with AWS ThissectionoutlinesfourDRscenariosthathighlightthe use of AWSandcompare AWS withtraditional DR methods. The followingfigure shows aspectrumforthe fourscenarios, arrangedbyhow quicklya systemcanbe available tousers aftera DR event. Figure1: Spectrum ofDisaster Recovery Options AWS enables youtocost-effectivelyoperateeachof these DRstrategies.It’simportanttonote thatthese are just examplesof possibleapproaches,andvariationsandcombinationsof these are possible. If yourapplicationisalready runningonAWS, thenmultipleregionscanbe employedandthe same DRstrategies will still apply. Backup and Restore In mosttraditional environments,dataisbackedupto tape and sentoff-site regularly. If youuse thismethod,itcantake a longtime to restore yoursysteminthe eventof a disruptionordisaster. AmazonS3isan ideal destinationforbackup data that mightbe neededquicklytoperformarestore.Transferringdatato andfrom AmazonS3 is typicallydone throughthe network, andistherefore accessible fromanylocation.There are manycommercial andopen-source backupsolutionsthat integrate with AmazonS3. Youcan use AWSImport/Exporttotransferverylarge data setsby shippingstorage devicesdirectlytoAWS.Forlonger-termdatastorage where retrieval timesof several hoursare adequate,there isAmazonGlacier,whichhasthe same durabilitymodel asAmazonS3.AmazonGlacierisa low-cost alternative startingfrom$0.01/GB per month.AmazonGlacierandAmazonS3 can be usedinconjunctiontoproduce a tieredbackupsolution. AWS Storage Gatewayenables snapshotsof youron-premisesdatavolumestobe transparentlycopiedinto AmazonS3 for backup.You can subsequently create local volumesorAmazonEBS volumesfromthese snapshots. Storage-cachedvolumesallow youtostore yourprimarydata in AmazonS3, but keepyourfrequentlyaccesseddata local for low-latencyaccess.AswithAWSStorage Gateway, youcan snapshot the data volumes togive highlydurable backup.In the eventof DR, youcan restore the cache volumeseithertoa secondsite runningastorage cache gateway or to AmazonEC2. You can use the gateway-VTLconfigurationof AWSStorage Gatewayasa backup targetfor yourexistingbackup managementsoftware. Thiscanbe usedas a replacementfortraditional magnetictape backup. For systemsrunningonAWS, youalsocan back up intoAmazonS3. Snapshotsof AmazonEBS volumes,AmazonRDS databases,andAmazonRedshift datawarehousescanbe storedinAmazonS3. Alternatively,youcancopy filesdirectly intoAmazonS3, or you can choose to create backupfilesandcopythose to AmazonS3. There are many backup solutionsthatstore datadirectlyinAmazonS3, andthese can be usedfromAmazonEC2 systemsaswell.
  • 10.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 10 of 22 The followingfigure showsdatabackupoptionstoAmazonS3,from eitheron-siteinfrastructure orfromAWS. Figure 2: Data Backup Options to Amazon S3 from On-SiteInfrastructureor from AWS. Of course,the backupof your data isonlyhalf of the story. If disasterstrikes,you’llneedtorecoveryourdata quickly and reliably. Youshouldensure that yoursystemsare configuredto retainandsecure yourdata,and youshould test your data recoveryprocesses. The followingdiagramshowshowyoucanquicklyrestore asystemfromAmazonS3 backupsto AmazonEC2. Figure 3: Restoring a System from Amazon S3 Backups to Amazon EC2
  • 11.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 11 of 22 Keystepsforbackup andrestore: 1. Selectanappropriate tool or methodtoback up yourdata intoAWS. 2. Ensure that youhave an appropriate retentionpolicyforthisdata. 3. Ensure that appropriate securitymeasuresare inplace forthisdata, includingencryptionandaccesspolicies. 4. Regularlytestthe recoveryof thisdataand the restorationof yoursystem. Pilot Light for Quick Recovery intoAWS The term pilot light isoftenusedto describe aDR scenario inwhicha minimal versionof anenvironmentisalways runninginthe cloud. The ideaof the pilotlightisananalogythat comesfromthe gas heater.Ina gas heater,a small flame that’salwaysoncan quicklyignite the entirefurnace toheatupa house. Thisscenarioissimilartoa backup-and-restore scenario.Forexample,withAWSyoucan maintainapilotlightby configuringandrunningthe mostcritical core elementsof yoursysteminAWS.Whenthe time comesforrecovery,you can rapidlyprovisionafull-scale productionenvironmentaroundthe critical core. Infrastructure elementsforthe pilotlightitself typicallyincludeyourdatabase servers,whichwouldreplicatedatato AmazonEC2 or AmazonRDS.Dependingonthe system, there mightbe othercritical dataoutside of the database that needstobe replicatedtoAWS.Thisis the critical core of the system(the pilotlight) aroundwhichall other infrastructure piecesinAWS (the restof the furnace) canquicklybe provisionedtorestore the completesystem. To provisionthe remainderof the infrastructure torestore business-critical services,youwouldtypicallyhave somepre- configuredserversbundledasAmazonMachine Images(AMIs),whichare readytobe started upat a moment’snotice. Whenstartingrecovery,instancesfromthese AMIscome upquickly withtheirpre-definedrole(forexample, Webor AppServer) withinthe deploymentaroundthe pilotlight. Fromanetworkingpointof view,you have twomainoptions for provisioning:  Use ElasticIPaddresses,whichcanbe pre-allocatedand identified inthe preparationphase forDR,and associate themwithyourinstances.Note thatforMAC address-basedsoftware licensing,youcanuse elastic networkinterfaces(ENIs),whichhave aMAC addressthat can alsobe pre-allocatedtoprovisionlicensesagainst. You can associate these withyourinstances,justasyouwouldwithElasticIPaddresses.  Use ElasticLoad Balancing(ELB) to distribute traffictomultiple instances.YouwouldthenupdateyourDNS recordsto pointat your AmazonEC2 instance or pointto your loadbalancerusinga CNAME. We recommend thisoptionfortraditional web-basedapplications. For lesscritical systems,youcanensure thatyouhave any installationpackagesandconfigurationinformationavailable inAWS, forexample,inthe formof an AmazonEBS snapshot.Thiswill speedupthe applicationserversetup,because youcan quicklycreate multiple volumesinmultipleAvailabilityZonestoattachto AmazonEC2 instances.Youcan then install andconfigure accordingly,forexample,byusingthe backup-and-restoremethod. The pilotlightmethodgivesyouaquickerrecovery time thanthe backup-and-restore methodbecause the core pieces of the systemare alreadyrunningandare continuallykeptuptodate.AWS enablesyoutoautomate the provisioning and configurationof the infrastructure resources,whichcanbe a significantbenefittosave time andhelpprotect againsthumanerrors. However,youwill stillneedtoperformsome installationandconfigurationtaskstorecoverthe applicationsfully.
  • 12.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 12 of 22 Preparation phase The followingfigure showsthe preparationphase,inwhichyouneedtohave yourregularlychangingdatareplicatedto the pilotlight,the small core aroundwhichthe full environmentwillbe startedinthe recoveryphase.Yourless frequentlyupdateddata,suchasoperatingsystemsandapplications,canbe periodicallyupdatedandstoredasAMIs. Figure 4: ThePreparation PhaseofthePilot Light Scenario Key stepsforpreparation: 1. Setup AmazonEC2 instancestoreplicate ormirrordata. 2. Ensure that youhave all supportingcustomsoftware packagesavailable inAWS. 3. Create and maintainAMIsof key serverswhere fastrecoveryisrequired. 4. Regularlyrunthese servers,testthem, andapplyanysoftware updatesandconfigurationchanges. 5. Considerautomatingthe provisioningof AWSresources. Recoveryphase To recoverthe remainderof the environmentaroundthe pilotlight,you canstart your systemsfromthe AMIs within minutesonthe appropriate instance types.Foryourdynamicdata servers,youcanresize themtohandle production volumesasneededoradd capacityaccordingly.Horizontal scalingoften isthe mostcost-effectiveandscalable approach to add capacityto a system.Forexample,youcan add more webserversatpeaktimes.However, youcan alsochoose largerAmazonEC2 instance types,andthusscale vertically forapplicationssuchasrelational databases.Froma networkingperspective,anyrequiredDNSupdatescanbe done inparallel.
  • 13.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 13 of 22 Afterrecovery,youshouldensure thatredundancyisrestoredasquicklyaspossible.A failure of yourDRenvironment shortlyafteryourproductionenvironmentfailsisunlikely, butyoushould be aware of thisrisk.Continue totake regular backupsof yoursystem,andconsideradditionalredundancyatthe data layer. The followingfigure showsthe recoveryphase of the pilotlightscenario. Figure 5: TheRecovery PhaseofthePilot Light Scenario. Key stepsforrecovery: 1. Start your applicationAmazonEC2instancesfromyourcustomAMIs. 2. Resize existingdatabase/datastore instancestoprocessthe increased traffic. 3. Addadditional database/datastore instancestogive the DRsite resilience inthe datatier;if youare using AmazonRDS, turnon Multi-AZtoimprove resilience. 4. Change DNSto pointat the AmazonEC2 servers. 5. Install andconfigure anynon-AMIbasedsystems,ideallyinanautomated way.
  • 14.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 14 of 22 Warm Standby Solutionin AWS The term warmstandby isusedtodescribe aDR scenario inwhicha scaled-downversionof afullyfunctional environmentisalwaysrunninginthe cloud. A warmstandbysolutionextendsthe pilotlightelementsandpreparation.It furtherdecreasesthe recoverytime because some servicesare alwaysrunning.Byidentifyingyourbusiness-critical systems,you canfullyduplicate thesesystemsonAWSandhave themalwayson. These serverscanbe runningon a minimum-sizedfleetof AmazonEC2instancesonthe smallestsizespossible.This solutionisnotscaledtotake a full-productionload,butitisfullyfunctional.It canbe usedfornon-productionwork, such as testing,qualityassurance,andinternal use. In a disaster,the systemisscaledupquicklytohandle the productionload.InAWS,thiscanbe done by addingmore instancestothe loadbalancerand by resizingthe small capacityserverstorunonlargerAmazonEC2 instance types.As statedinthe precedingsection,horizontal scalingispreferredoververtical scaling. Preparation phase The followingfigure showsthe preparationphase forawarm standbysolution,inwhichanon-site solution andanAWS solutionrunside-by-side. Figure 6: ThePreparation Phaseofthe Warm Standby Scenario.
  • 15.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 15 of 22 Key stepsforpreparation: 1. Setup AmazonEC2 instancestoreplicate ormirrordata. 2. Create and maintainAMIs. 3. Run yourapplicationusingaminimal footprintof AmazonEC2instances or AWSinfrastructure. 4. Patch and update software andconfigurationfilesinlinewithyourlive environment. Recoveryphase In the case of failure of the productionsystem,the standbyenvironmentwillbe scaledupforproductionload,andDNS records will be changedtoroute all traffictoAWS. Figure 7: TheRecovery Phaseofthe Warm Standby Scenario. Key stepsforrecovery: 1. Increase the size of the AmazonEC2 fleetsinservice withthe loadbalancer(horizontal scaling). 2. Start applicationsonlargerAmazonEC2 instance typesasneeded(vertical scaling). 3. Eithermanuallychange the DNSrecords,or use AmazonRoute 53 automatedhealthcheckssothatall trafficis routedto the AWS environment. 4. ConsiderusingAutoScalingtoright-sizethe fleetoraccommodate the increasedload. 5. Addresilience orscale upyour database.
  • 16.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 16 of 22 Multi-Site Solution Deployedon AWS and On-Site A multi-site solutionrunsinAWSaswell ason your existingon-site infrastructure,inanactive-active configuration.The data replicationmethodthatyouemploywill be determinedbythe recovery pointthatyouchoose. Formore information aboutrecoverypointoptions,seethe RecoveryTime Objective andRecoveryPointObjectivesectioninthis whitepaper. In additiontorecoverypointoptions,thereare variousreplicationmethods, suchassynchronousandasynchronous methods. Formore information,seethe Replicationof Datasectioninthiswhitepaper. You can use a DNSservice thatsupportsweightedrouting,suchas AmazonRoute 53, to route productiontrafficto differentsites thatdeliverthe same applicationorservice.A proportionof trafficwill gotoyour infrastructure inAWS, and the remainderwill gotoyouron-site infrastructure. In an on-site disastersituation,youcanadjustthe DNSweightingandsendall traffictothe AWS servers.The capacityof the AWS service canbe rapidlyincreasedtohandle the full productionload. Youcanuse AmazonEC2 AutoScalingto automate thisprocess.Youmightneedsome applicationlogictodetectthe failure of the primarydatabase servicesand cut overto the parallel database servicesrunninginAWS. The cost of thisscenarioisdeterminedbyhowmuchproductiontrafficishandledbyAWSduring normal operation.In the recoveryphase,youpay only for whatyouuse forthe durationthat the DR environmentisrequiredatfull scale.You can furtherreduce costby purchasing AmazonEC2 ReservedInstancesforyour“alwayson”AWSservers. Preparation phase The followingfigure showshow youcanuse the weightedroutingpolicy of the AmazonRoute 53 DNS to route a portion of yourtrafficto the AWS site.The applicationonAWSmightaccessdata sourcesinthe on-site productionsystem.Data isreplicatedormirroredto the AWSinfrastructure. Figure 8: ThePreparation PhaseoftheMulti-SiteScenario.
  • 17.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 17 of 22 Key stepsforpreparation: 1. Setup your AWSenvironmenttoduplicate yourproductionenvironment. 2. Setup DNS weighting, orsimilartrafficroutingtechnology,todistributeincomingrequeststobothsites. Configure automatedfailovertore-route trafficawayfromthe affectedsite. Recoveryphase The followingfigure shows the change in trafficroutinginthe eventof anon-site disaster. Trafficiscutoverto the AWS infrastructure byupdatingDNS,andall trafficand supportingdataqueriesare supportedbythe AWSinfrastructure. Figure 9: TheRecovery Phaseofthe Multi-SiteScenario Involving On-Siteand AWS Infrastructure. Key stepsforrecovery: 1. Eithermanuallyorby usingDNSfailover, change the DNSweightingsothatall requestsare senttothe AWS site. 2. Have applicationlogicforfailovertouse the local AWS database servers forall queries. 3. ConsiderusingAutoScalingtoautomaticallyright-size the AWSfleet. You can furtherincrease the availabilityof yourmulti-site solutionbydesigningMulti-AZarchitectures.Formore informationabouthowtodesignapplicationsthatspanmultipleavailabilityzones,see the BuildingFault-Tolerant ApplicationsonAWS whitepaper.
  • 18.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 18 of 22 AWS Productionto an AWS DR SolutionUsing Multiple AWS Regions ApplicationsdeployedonAWShave multi-site capability bymeansof multipleAvailabilityZones.Availability Zonesare distinctlocationsthatare engineeredtobe insulatedfrom eachother.They provide inexpensive,low-latencynetwork connectivity withinthe same region. Some applicationsmighthave anadditionalrequirementtodeploytheircomponentsusingmultiple regions;thiscanbe a businessorregulatoryrequirement. Anyof the precedingscenarios inthiswhitepapercanbe deployedusingseparate AWSregions.The advantagesforboth productionandDR scenariosinclude the following:  You don’tneedtonegotiate contractswithanotherproviderinanotherregion  You can use the same underlyingAWS technologiesacrossregions  You can use the same toolsor APIs For more information,see the MigratingAWSResourcestoaNew Region whitepaper. Replication of Data Whenyoureplicate datato a remote location, youshouldconsiderthese factors:  Distance betweenthe sites — Larger distancestypicallyare subjecttomore latencyorjitter.  Available bandwidth— The breadthand variability of the interconnections.  Data rate requiredby your application — The data rate shouldbe lowerthanthe available bandwidth.  Replicationtechnology— The replication technologyshouldbe parallel (sothatitcan use the network effectively). There are twomainapproaches for replicatingdata:synchronousandasynchronous. Synchronous replication Data is atomicallyupdatedinmultiple locations.Thisputsadependency onnetworkperformance andavailability. In AWS, Availability Zoneswithinaregionare well connected,butphysicallyseparated. Forexample,whendeployedin Multi-AZmode,AmazonRDSusessynchronousreplicationtoduplicate dataina secondAvailabilityZone.Thisensures that data isnot lostif the primaryAvailabilityZone becomesunavailable. Asynchronousreplication Data is notatomically updatedinmultiple locations.Itis transferred asnetworkperformance andavailabilityallows,and the applicationcontinuestowrite datathat mightnotbe fullyreplicatedyet. Many database systemssupportasynchronousdatareplication.The database replicacanbe locatedremotely,andthe replicadoesnothave to be completelysynchronizedwiththe primarydatabase server.Thisisacceptable inmany scenarios,forexample,asa backupsource or reporting/read-onlyuse cases. Inadditionto database systems, youcan alsoextend ittonetworkfile systemsanddatavolumes. We recommendthatyou understandthe replicationtechnologyusedin yoursoftware solution.A detailedanalysisof replicationtechnologyisbeyond the scope of thispaper.
  • 19.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 19 of 22 AWS regionsare completelyindependentof eachother,butthere are no differencesinthe wayyouaccessthemand use them.Thisenables youtocreate DR processesthatspan continental distances,withoutthe challengesorcoststhat thiswouldnormallyincur. Youcan back updata and systemstotwo or more AWS regions, allowingservice restoration eveninthe face of extremelylarge-scaledisasters. Youcanuse AWS regionstoserve yourusers aroundthe globe with relativelylowcomplexityto youroperational processes. Failing Back from a Disaster Once you have restoredyourprimary site toa workingstate,youwill needto restore yournormal service,whichisoften referredtoas a “fail back.” DependingonyourDR strategy,thistypically meansreversingthe flowof datareplicationso that any data updatesreceivedwhile the primarysite wasdowncanbe replicatedback,withoutthe lossof data. The followingsteps outline the differentfail-backapproaches: Backup and restore 1. Freeze datachangesto the DR site. 2. Take a backup. 3. Restore the backupto the primarysite. 4. Re-pointuserstothe primarysite. 5. Unfreeze the changes. Pilotlight, warm standby, and multi-site 1. Establishreverse mirroring/replicationfromthe DRsite backto the primarysite,once the primarysite has caught up withthe changes. 2. Freeze datachangesto the DR site. 3. Re-pointuserstothe primarysite. 4. Unfreeze the changes.
  • 20.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 20 of 22 Improving Your DR Plan Thissectiondescribesthe importantstepsyoushouldfollow toestablisha strongDR plan. Testing AfteryourDR solutionisinplace,itneedstobe tested. You can testfrequently,whichisone of the keyadvantagesof deployingonAWS.“Game day”is whenyouexercise afailovertothe DR environment,ensuringthatsufficient documentationisinplace tomake the processas simple aspossible shouldthe real eventtake place.Spinningupa duplicate environmentfortestingyourgame-dayscenariosisquickandcost-effectiveonAWS,andyou typicallydon’t needtotouch yourproductionenvironment.Youcanuse AWS CloudFormation todeploycomplete environmentson AWS.This usesa template todescribe the AWSresourcesandanyassociateddependenciesorruntime parameters that are requiredtocreate a full environment. Differentiatingyourtestsiskeytoensuringthatyouare coveredagainsta multitude of differenttypesof disasters.The followingare examplesof possible game-dayscenarios:  Powerlosstoa site or a setof servers  Loss of ISP connectivitytoasingle site  Virusimpactingcore businessservices thataffects multi-sites  User error thatcauses the lossof data, requiringapoint-in-time recovery Monitoringand alerting You needtohave regularchecksand sufficient monitoringinplace toalertyou whenyourDR environmenthasbeen impactedbyserverfailure,connectivityissues,andapplicationissues. AmazonCloudWatch providesaccesstometrics aboutAWS resources,aswell as custommetrics that can be application–centricorevenbusiness-centric.Youcansetup alarmsbasedon definedthresholdsonanyof the metricsand,where required,youcansetup AmazonSNSto send alertsincase of unexpectedbehavior. You can use any monitoringsolutionsonAWS,andyoucan also continue touse anyexistingmonitoringandalerting toolsthat yourcompanyusesto monitoryourinstance metrics,aswell asguestOS statsand applicationhealth. Backups Afteryouhave switched toyourDR environment,youshouldcontinuetomake regularbackups.Testingbackupand restore regularlyisessential asafall-backsolution. AWS givesyouthe flexibilityto performfrequent,inexpensive DRtestswithoutneedingthe DRinfrastructure tobe “alwayson.” Useraccess You can secure accessto resourcesinyour DR environmentbyusing AWSIdentityandAccessManagement (IAM). With IAM, youcan create role-basedand user-basedsecuritypoliciesthatsegregateuserresponsibilities andrestrictuser access to specified resourcesand tasksinyourDR environment.
  • 21.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 21 of 22 Systemaccess You can alsocreate rolesforyourAmazonEC2 resources,sothat onlyuserswhoare assignedtospecifiedroles can performdefinedactionsonyourDR environment,suchasaccessingan AmazonS3 bucketor re-pointinganElasticIP address. Automation You can automate the deploymentof applicationsontoAWS-basedserversandyouron-premisesserversby using configurationmanagementororchestrationsoftware.Thisallowsyoutohandle applicationandconfigurationchange managementacrossbothenvironmentswithease.There are several popularorchestrationsoftware optionsavailable. For a listof solutionproviders,seethe AWSPartnerDirectory.3 AWS CloudFormation worksinconjunctionwithseveral toolstoprovisioninfrastructure servicesinanautomated way. Higherlevelsof abstractionare alsoavailablewith AWSOpsWorks orAWSElasticBeanstalk.The overall goal isto automate yourinstancesasmuch as possible.Formore information,see the Architectingforthe Cloud:BestPractices whitepaper. You can use AutoScalingto ensure thatyour pool of instancesisappropriatelysizedtomeetthe demandbasedonthe metricsthat youspecifyinAWSCloudWatch.Thismeansthatina DR situation,asyouruserbase starts to use the environmentmore,the solutioncanscale updynamicallytomeetthisincreaseddemand.Afterthe eventisoverand usage potentiallydecreases,the solutioncanscale backdownto a minimumlevel of servers. Software Licensing and DR Ensuringthat youare correctlylicensedforyourAWSenvironmentisasimportantaslicensing foranyother environment.AWSprovidesavarietyof modelstomake licensingeasierforyouto manage.Forexample,“BringYour OwnLicense”ispossible forseveralsoftwarecomponentsoroperatingsystems.Alternately,there isarange of software for whichthe cost of the license isincludedinthe hourlycharge.Thisisknownas“License included.” “Bring yourOwnLicense”enablesyou toleverage yourexistingsoftwareinvestmentsduringadisaster.“License included”minimizesup-frontlicensecostsfora DR site that doesn’tgetusedona day-to-daybasis. If at anystage youare indoubtabout yourlicensesandhow theyapplytoAWS,contact your license reseller. Conclusion Many optionsandvariationsforDR exist.Thispaperhighlightssome of the common scenarios,rangingfromsimple backupand restore to faulttolerant,multi-site solutions.AWSgivesyoufine-grainedcontrol andmanybuildingblocks to buildthe appropriate DRsolution,givenyourDRobjectives(RTOandRPO) andbudget.The AWSservices are available on-demand, andyoupay only for whatyouuse.Thisis a keyadvantage forDR, where significantinfrastructure isneededquickly,butonlyinthe eventof a disaster. Thiswhitepaperhasshownhow AWSprovidesflexible,cost-effective infrastructure solutions, enablingyoutohave a more effectiveDRplan. 3 Solution providers can be found at http://aws.amazon.com/solutions/solution-providers/
  • 22.
    Amazon Web Services– Using AWS for Disaster Recovery October2014 Page 22 of 22 Further Reading  Amazon S3Getting Started Guide: http://docs.amazonwebservices.com/AmazonS3/latest/gsg/  Amazon EC2Getting Started Guide: http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/  AWS PartnerDirectory(fora listof AWS solution providers): http://aws.amazon.com/solutions/solution-providers/  AWS SecurityandCompliance Center: http://aws.amazon.com/security/  AWS Architecture Center: http://aws.amazon.com/architecture  Whitepaper:DesigningFault-TolerantApplicationsinthe AWSCloud  OtherAWS technical whitepapers:http://aws.amazon.com/whitepapers Document Revisions We’ve made the followingchanges tothiswhitepapersince itsoriginal publicationinJanuary,2012:  Updatedinformationabout AWSregions  Added informationabout newservices:AmazonGlacier,Amazon Redshift,AWSOpsWorks,AWSElastic Beanstalk,andAmazonDynamoDB  Added informationabout elasticnetworkinterfaces(ENIs)  Addedinformationaboutvariousfeaturesof AWSservicesforDRscenariosusingmultiple AWSregions  Addedinformationabout AWSStorage Gateway virtual tape libraries