I N F O W O R L D . C O M D E E P D I V E S E R I E S 2Disaster Recovery Deep DiveiSmart strategies fordata protectionDisaster recovery and business continuity may be IT’s most importantresponsibilities. The key is to create smart policies collaboratively—and thenline up the appropriate technologies to matchi By Matt PriggeDATA IS LIKE OXYGEN. Without it, most modernorganizations would quickly find themselves gaspingfor breath; if they went too long without their customer,transaction, or production data, they’d die.The threats to data are everywhere, from natural disas-ters and malware outbreaks to simple human error, likeaccidental deletions. Yet surprisingly few IT organizationstake the time to construct the policies and infrastructureneeded to protect the data organizations need to survive.The result? Poorly planned, ad hoc recovery and continuitymethodologies that no one fully understands and which areoften obsolete before they’re even implemented.Thriving in today’s data-centric environment requiresa renewed focus on disaster recovery and business con-tinuity procedures and policies. It’s not just an IT prob-lem. Data loss affects every part of every organization;because of that, every stakeholder—from the executivesuite to facilities and maintenance—needs to be involvedin the decisions surrounding data ownership.This type of collaborative approach is neither easy norinexpensive. But it will allow both IT and business stake-holders to be on the same page about where the organi-zation’s disaster recovery (DR) and business continuity(BC) dollars are being spent, and why. More important, itwill ensure everyone knows what to expect in the eventof a disaster and can respond appropriately.WHAT ARE DR AND BC?Disaster recovery and business continuity have come tomean different things to different people. But the coreconcepts are the same: Deploying the correct policies,procedures and layers of technology to allow organiza-tions to continue to function in the event disaster strikes.This can range from something as simple as having easilyaccessible backups in case of hardware failure to nearlyinstant fail-over to a secondary data center when yourprimary one is taken out by a cataclysmic event.Generally speaking, DR is a series of procedures andpolicies that when properly implemented allow yourorganization to recover from a data disaster. How longit takes you to recover, and how stale the data will beonce you’re back online, will depend on the recoverytime objectives (RTOs) and recovery point objectives(RPOs) decided upon by your organization’s stakehold-ers. DR’s job is not to provide consistent up-time, but toensure you can recover your data in the event that it’slost, inaccessible, or compromised.A comprehensive disaster recovery plan would alsoinclude anything else required to get your environmentback on its feet, including how and where backup setsare stored, how they can be retrieved, what hardware isneeded to access them, where to obtain alternate hard-ware if the primary systems are inaccessible, whetherinstallation media for applications would be required,and so on.Business continuity systems aim to mitigate or neu-tralize the effects of a disaster by minimizing businessinterruption. They typically rely on high availability (HA)hardware—redundant systems operating in real time andwith real data, so that if one fails the others can pick upthe slack almost instantaneously with little or no effecton business processes. An HA event (such as a clusternode failing over) might introduce a minute’s worth of
I N F O W O R L D . C O M D E E P D I V E S E R I E S 3Disaster Recovery Deep Diveiservice interruption, whereas recovering a backup onto anew piece of hardware could take hours or even days—especially if replacement hardware isn’t readily available.But having HA systems in place does not ensuredisaster won’t strike. There will always be failure vec-tors like data corruption, software glitches, power spikes,and storms that can skip right on past every redundantsystem to despoil your data and ruin your day. That’sthe most important thing to remember about HA: Itonly lowers your probability of experiencing disaster, itdoesn’t make you immune.Business continuity ties together the availability goalsof HA with the recoverability goals of DR. A carefullyconceived BC design can allow you to recover from afull-blown disaster with limited downtime and near-zerodata loss. It is by far the most involved and expensiveapproach to take, but many enterprises have concludedthat data is so vital to their day-to-day operations BC istoo important not to pursue.BUILDING A POLICYThere are four main stages to building a coherent BC/DR policy for your organization. The first task is to definea scope that details which applications will be covered bythe policy. Second, you need to define the risks you’reattempting to protect your infrastructure against. Thethird task is to define the precise requirements of thepolicy in terms of RTO and RPO for the risk categoriesyou’ve defined. Only after you’ve completed those stepscan you determine the technological building blocksyou’ll need to deliver on the policies you’ve defined.Throughout this process, it’s vital to involve businessstakeholders in the decision-making process—howevernontechnical they may be. In fact, the less technicalthose stakeholders are, the more critical it is that theybe involved. A failure to set downtime and recoveryexpectations properly can have devastating, career-end-ing consequences.SCOPEWhen considering the scope of a BC/DR plan, it’s impor-tant to start by looking at the user-facing applications youare attempting to protect. If you start from the perspec-tive of the application and then determine everythingit needs to survive, you’ll find all the infrastructure anddata associated with it. If you start from the infrastruc-ture and work up to the application, it’s easy to blindyourself to things the application needs in order to run.For example, you could create a flawless backup of yourapplication and database clusters, but still forget to backup a file share that contains the application client, alongwith all of its hard-to-reproduce configuration data.Though it makes a great deal of sense to separate lessmission-critical applications from those that the businessdepends upon heavily, it is increasingly common to seeBC/DR policy scopes that extend to nearly all of thecore IT infrastructure. This is primarily as a result of thepopularity of server virtualization, network convergence,and centralization of primary storage resources. In suchenvironments, a BC/DR policy that seeks to protect amission-critical application served by a small collectionof virtual machines may well end up including nearlyevery major resource in the data center within its scope.This is not necessarily a bad thing, but it can drive up thecosts significantly.RISKSAfter a scope for the policy has been set, the next stepis to determine which risks the policy will attempt toaddress. IT architectures are vulnerable to a dizzyingarray of risks, ranging from disk failure all the way up toHurricane Katrina-level natural disasters.Typically, almost everyone will want to plan for com-mon risks such as hardware failure, but some organi-zations may choose not to plan for less-probable riskssuch as regional disasters or pandemic outbreaks. It’snot unheard of to see significant investments made inan effort to mitigate disasters that the business’s coreoperations—specifically human resources—could notwithstand in any case.Knowing which risks you are and are not protectingthe organization’s data against is critical to thoroughplanning and implementation. No matter what yourorganization chooses to do, the most important thing isthat everyone from the C-level executives on down ison board with the decisions made.REQUIREMENTSAfter you’ve set your scope and defined the risks, it’stime to put hard numbers on how quickly your BC/DRpolicy dictates you recover. At the very least, this should
I N F O W O R L D . C O M D E E P D I V E S E R I E S 4Disaster Recovery Deep Diveiinclude targets for recovery times and recovery points.However, your organization may not decide to place thesame RTO/RPO requirements on every kind of disaster.For example, your organization may require a30-minute RTO and 15-minute RPO for a particularlycritical database application in the face of hardware andsoftware failures, but may be satisfied with longer RTO/RPOs if the entire office building is destroyed by a fireand employees are unable to return to work.Of course, we’d all like to have near-zero recoverytime and data loss. While this is absolutely possible withtechnology available today, it also may be prohibitivelyexpensive. Thus, defining policy requirements is a bal-ancing act between the relative cost of providing ultra-low RTO/RPO and the cost to the business of downtimeor data loss.However, it’s important to avoid getting caught up indiscussions of how much recovery options will cost untilyou’ve determined what the organization truly needs inorder to remain viable. Too often, huge price tags willdissuade IT from even offering aggressive RTO/RPOsolutions for fear of being laughed out of the corneroffice. If in the final analysis budgets won’t support moreexpensive options, a collective decision can be made toscale back the requirements, but it will be made basedon a full understanding of the situation.BUILDING BLOCKSAfter you have a solid idea of the services you need toprotect, the risks you need to protect them from, andyour recovery objectives, you can figure out the tech-nological building blocks you’ll need to deliver on thepolicy. Due to the sheer number of different solutionsavailable, there is really no one right way to protect anygiven infrastructure.The trick to designing a good BC/DR solution is totreat each component as a layer of protection, each withdifferent capabilities that combine to provide a compre-hensive solution.SOFTWARE BUILDING BLOCKSThe heart of any BC/DR strategy is the software thatmakes it tick. Generally speaking, this will involve at leastone kind of backup software for disaster recovery, butmay also leverage application-level backup/replicationand even full-blown host-based replication software todeliver on business continuity goals.Agent-Based Backup—Traditionally, the most com-mon software component you’ll see is the agent-basedbackup utility. This software generally utilizes one ormore centralized server installations to draw backup datafrom software agents installed within the individual serv-ers you want to protect, then stores the backups on tape,direct-attached disk (DAS), or other storage hardware.If disaster strikes and data is lost, the data can berestored to the way it was at the time of the last backup.This works reasonably well for situations where you’reprotecting against data corruption or deletion and thereis a relatively long RPO (usually 24 hours or more).However, when risks include wholesale loss of a servercontaining a substantial amount of data, traditionalagent-based backup is unlikely to meet more aggressiveRTO requirements, since restoring a server from baremetal can require a significant amount of time.But don’t write off agent-based backup utilities justyet. Despite their stodgy reputation, modern backupagents may offer advanced features such as image-based backups and client side deduplication, which canboth shorten recovery objectives and decrease the timeneeded to create backup sets.Image-based backups capture the contents of anentire server at once instead of file by file and restorethe machine exactly as it was, making backups easierto capture and restore. Client-side deduplication letsyou use cheap centralized backup storage and distrib-ute the dedupe load, making backup over a WAN fasterbecause there’s less data to pass down the wire. Thesemore advanced forms of agent-based backup are usefulfor organizations who need to shorten their RTO/RPObut don’t have the virtualization stacks in place to makebackup more seamless.Virtualized Backup—Virtualized infrastructures havesignificantly more backup options available to them.Because virtualization introduces an abstraction layerbetween the operating system and the underlying serverand storage hardware, it is far easier to both obtain con-sistent image-based backups and to restore those back-ups onto different hardware configurations should theneed ever arise.
I N F O W O R L D . C O M D E E P D I V E S E R I E S 5Disaster Recovery Deep DiveiIn fact, image-based virtualization backup softwarecan also be leveraged to perform business continuityduties in addition to disaster recovery. Instead of stor-ing backups on dedicated storage media like tape ornetwork-attached storage, the backups can be storedlive on a backup virtualization cluster– a collection ofvirtualized hosts located in a secondary location thatcan be turned on at a moment’s notice. Though it’s dif-ficult to achieve RPOs of under 30 minutes with thisapproach, it can be an exceptionally cheap way to offeraggressive RTO when more advanced hardware-basedapproaches like SAN-to-SAN replication (see below)aren’t feasible.Host-Based Replication—You can achieve a similarlyaggressive RTO for physical servers by using host-basedsoftware that replicates data from your physical produc-tion servers to dedicated recovery servers, which can beeither physical or virtual. However, that solution requiresan expensive one-to-one ratio between your protectedassets and recovery assets—you’d essentially be payingfor two of everything. This kind of software is also typi-cally not dual-use, which means you’d still need to fieldseparate disaster recovery software to provide archivalbackups. However, in situations where operating systemsand applications can be virtualized and have not been,it is often possible to utilize host-based software to rep-licate many physical servers onto a single virtualizationhost— handily avoiding the need for a duplication ofphysical resources.Application-Level Backup—Certain types of appli-cations come with their own tools for providing disas-ter recovery and business continuity. For example, youcould configure a database engine to capture full back-ups and snapshots of transaction logs each night to asecondary storage device for DR, while replicating trans-action logs to a server in a secondary data center forBC. Though these solutions can provide a cheap way toprovide very stringent RTO/RPO, they are also one-offsolutions that will only apply to applications that supportthem. If you’ve got more than one critical database withits own DR and BC processes, you’d have to manageeach of them separately, increasing the complexity ofyour BC/DR approach and creating a potentially hugesource of headaches.Keep it Simple!—No matter what mix of software youchoose, try to keep it as simple as possible. Though youcan achieve a great deal by carefully layering several dif-ferent kinds of backup software and adding a healthydose of customized scripting, the more complexity youintroduce into the equation, the more likely you are toencounter backup failures or—worse—put your data ateven greater risk.HARDWARE BUILDING BLOCKSNo matter what kind of approach you use, hardwarewill play an essential role. With disaster recovery thehardware’s job is to store backup data and provide dataretention/archival capabilities. In BC applications, therole of hardware can much more varied—ranging fromvirtualization hosts with their own inboard storage, toshared network attached storage (NAS) systems, andsynchronized storage area networks (SANs) hundredsof miles apart. The public cloud can also be leveragedto play a substantial role in both DR and BC.Tape Backup—When most people think of DR-relatedbackups, the first thing that comes to mind is usually tapebackup. Though much maligned, tape backup systemshave remained relevant due to sequential read and writeperformance that can be two to three times faster thandisk, as well as well as their long shelf life and the relativedurability. Though disk-based backup has become muchmore popular recently, disaster recovery approaches thatrequire backup media to be shipped off site for archivalstorage still generally rely on tape.Disk-Based Backup—However, tape suffers from anumber of substantial limitations, primarily because it’sa sequential medium. Reading random chunks of dataoff different parts of the tape can be very time consum-ing. Backups that rely heavily on updating or referenc-ing previous backup file data are much better suitedto on-line spinning disk. But even where disk is beingutilized as the first backup tier, off-site archives are stillfrequently stored on tape—forming the oft-used disk-to-disk-to-tape model.Disk backup media can take a number of differentforms. In the simplest application, it can be a backupserver with a large amount of direct-attached storage. Insituations where the backup software is running within a
I N F O W O R L D . C O M D E E P D I V E S E R I E S 6Disaster Recovery Deep Diveivirtualization environment or is on a system that storesbackup sets elsewhere, NAS is frequently used.NAS and VTL—In large installations requiring signi-ficant amounts of throughput and long periods of dataretention, it’s much more common to see dedicated,purpose-built NAS backup appliances. These devicescome in two basic flavors: one that acts as an arbi-trarily large NAS and is accessible via a variety of file-sharing protocols; and virtual tape libraries (VTLs) thatemulate iSCSI or Fibre-Channel-based tape libraries.VTLs can be useful in situations where legacy backupsoftware expects to have a real tape drive attached,but tapes themselves aren’t desirable. By using a VTL,you can provide a deduplicated disk-based backupoption while still allowing the legacy backup softwareto function.For most organizations that need to store largeamounts of data for a significant period of time, however,a NAS backup appliance is much more likely to be theright answer. The primary reason for this is that nearlyevery enterprise-grade backup appliance will implementsome kind of deduplication to remove portions of thedata set that are exact copies of each other—like, forexample, the same company-wide email sent to everyemployee’s in-box. Deduping the data can significantlyreduce the amount of storage required by each backupset, allowing organizations to increase the amount ofhistorical backup data they can maintain.There are a wide variety of deduplication products,some of which are more efficient than others, but theygenerally utilize one of two approaches: at-rest dedupli-cation and inline deduplication. In the first instance, datais backed up onto the appliance as-is, and then dedupli-cated after the backup is complete. In the second, datais deduplicated as it is being written onto the device.Typically, devices that leverage at-rest deduplicationtake less time to backup and restore data because theydon’t have to dedupe and/or rehydrate data as it moveson and off the device. However, because they muststore at least one back up in its un-deduplicated state,they also require more raw storage compared to inlinededuplication devices.The type of deduplication device that’s best for agiven application really depends upon how it will beused. For instance, if you plan to store backups of avirtualized environment on an NAS, you may want touse an at-rest deduplication platform due to its typicallyfaster read and write performance. Some virtualizationbackup software can restore a virtual machine while thatvirtual machine’s backup data resides on the NAS—yield-ing extremely short RTO.In this kind of instant restore model, a virtualizationhost can mount the backup image directly from theNAS, start the VM, and the live-migrate it back to theoriginal primary storage platform—essentially eliminatingthe time it would normally take to restore the data backto primary storage in a traditional backup. In that sense,what is normally considered a DR resource can be lever-aged to supply some level of BC functionality—especiallyif the NAS is located at a remote data center along withdedicated virtualization capacity.SAN Replication— However, most BC approachesrely on a dedicated warm site—a data center housed ina remote location that contains all of the infrastructurenecessary to keep the organization running if the pri-mary data center goes down. That site is almost alwayscoupled with some type of storage replication, usuallya secondary SAN that’s replicating data from the pro-duction SAN.Because it essentially doubles your organization’sinvestment in storage hardware and requires largeamounts of low-latency bandwidth to connect both sites,this type of replication can be very expensive, but it iswidely recognized as the most comprehensive businesscontinuity approach available.There are two kinds of SAN-to-SAN replication: syn-chronous and asynchronous. With synchronous replica-tion, data is kept in precise synchronization betweenthe production SAN and recovery SAN. By contrast,asynchronous replication captures snapshots of yourdata at specified intervals; if the primary storage systemfails, you can roll back to a recent snapshot to minimizethe data loss.At first glance, it would appear that synchronous rep-lication is better. Because data on the recovery SANis always identical to that on the production SAN, therecovery time is exactly zero. The problem? If that pro-duction data becomes corrupted, the corrupted datawill be replicated on the recovery SAN as well. (That’swhy disaster recovery is an essential component of even
I N F O W O R L D . C O M D E E P D I V E S E R I E S 7Disaster Recovery Deep Diveithe most robust business continuity system—you alwayswant to have a reliable backup set.) With asynchronousreplication, you can roll back to the most recent snap-shot of data preceding the corruption event.If you have stringent RTO requirements, you prob-ably need synchronous replication to meet them. Butyou’ll also need to make sure your DR resources arealigned correctly in case you need them. Dependingon the storage platform you choose you may be able toadopt a hybrid approach, taking snapshots of the SANproduction data and storing them on the recovery sideSAN—effectively providing you with zero RPO as wellas the ability to roll back to an earlier data set withouthaving to restore from backups.Cloud-based DR and BC—In scenarios where a largeamount of Internet bandwidth is available, the publiccloud can also provide a viable option for both DR andBC goals. In DR applications, cloud-based backup canprovide a replacement for tape media or NAS repli-cation in order to provide off-site backup protection.However, managing the security of backup data beingplaced in the cloud and ensuring that the backup datacan be quickly and easily brought back on site can bea challenge.The public cloud can also fill the role of a fully deployedhot recovery data center. Through the use of application-specific backup software or host-based replication tools,entire server infrastructures can be protected by storageand computing resources located in the cloud.Whether you decide to opt for cloud-based DR orBC depends on a variety of factors, including the type ofIT resources your organization has, whether you havethe skill set in house to navigate the fundamentally dif-ferent nature of cloud services, how much bandwidthyour data requires, and how closely you must adhere togovernment regulations regarding information security.TESTING AND COMPLIANCENo matter what solution you choose for disaster recov-ery or business continuity, building a consistent test-ing and compliance regimen—and sticking to it—is vital.Without frequent testing, you can never be certainthe data you thought was backed up can actually berestored, or that your secondary data center will comeon line when the Big One hits.This is where many organizations fail. Overworkedand understaffed IT orgs may feel they lack the time andresources required for frequent testing of DR and BC.But that may also be due to poorly designed processesor a lack of sufficient automation. Making an extra effortto streamline these processes will help not only in testingthese solutions but also in implementing them when areal disaster strikes.In the ideal scenario, organizations will deploy theirDR or BC solutions as part of their day-to-day opera-tions—allowing them to both leverage their investmentsand test them at the same time. For example, you mightuse the DR system for a mission-critical database to pop-ulate a read-only database platform that’s used for gener-ating reports. If a problem arises with a report, that couldbe a sign your DR system isn’t working as it should.Or you might routinely transition workloads betweena production data center and a recovery data center—spreading the load across both data centers when yourneed more performance while also demonstrating thatyour fail-over methodology is functioning properly.Backups can and do fail silently. If you do not testthem, you won’t find out until it’s too late—a situationnobody wants to find themselves in. You’ll also have anexcellent opportunity to evaluate whether you’re meet-ing your organization’s written RTO/RPO requirements,and to lobby for changes if you’re not.BC/DR: A BUSINESS NECESSITYAs businesses become increasingly dependent on datato survive, the importance of disaster recovery and busi-ness continuity measures cannot be overstated. Onlythrough thoughtful and collaborative planning thatinvolves all levels of the organization—both technicaland non-technical—can you field a comprehensive disas-ter recovery and business continuity strategy that appro-priately sets expectations and adequately protects thelifeblood of the business. i