Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High availability-in-storage-systems


Published on

Disaster Recovery (DR) has become the buzz words of all enterprises today. In the volatile and uncertain world of today, it becomes extremely important to plan for contingencies to protect against possible disasters.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

High availability-in-storage-systems

  1. 1. High Availability in Storage SystemsIntroductionIn today’s connected world, information and communication have become vital and fundamentalaspects of every sphere. Be it an individual or a business, data has become the lifeblood of our dailyexistence. Large scale panics due to Twitter blackouts are proof to this fact of life today. For businesses,even brief down times can result in substantial losses for the business. Long term down times resultingdue to various human and natural disasters can cripple a business and bring it to its knees. According toDunn and Bradstreet, 59% of Fortune 500 companies experience a minimum of 1.6 hours of down timeper week, which translates in to $46 million per year. According to Network Computing, the Meta Groupand Contingency Planning Research, the typical hourly cost of downtime varies from roughly $90,000 forMedia firms to about $6.5 million for Brokerage services. Thus, it becomes clear that based on thenature and the size of the business, the financial impacts of down time can vary from one end of thespectrum to another. Often, the impact of a down time cannot be predicted accurately. While there aresome obvious impacts of a down time in terms of lost revenue and productivity, there can be severalintangible impacts such as brand image damages that could have not-so-obvious and far-reachingeffects on the business.Disaster Recovery is not High AvailabilityDisaster Recovery (DR) has become the buzz words of all enterprises today. In the volatile and uncertainworld of today, it becomes extremely important to plan for contingencies to protect against possibledisasters. Disasters could be software related or hardware related. Software disasters could be a resultof virus and other security threats such as hacking, deletion of data –accidentally or with maliciousintent. Hardware related disasters could be a result of various hardware components such as drives,motherboards, power supplies, or natural and man-made site disasters such as fire, flooding and othernatural and man-made disasters. Different disasters need different recovery strategies – while softwarefailures can be protected using techniques such as Snapshots, Continuous Data Protection (CDP),hardware failures can be recovered by building component redundancies within the system (such asRAID, RPS) and ensuring that the data is backed up on alternate media using D2D, D2D2T backupmethodologies and synchronous and asynchronous replication strategies.Thus, disaster recovery is one aspect of Business Continuity (BC) strategies that a company must employto minimize the impacts of down time. Disaster recovery is often measured in terms of two Service LevelAgreement (SLA) objectives – Recovery Point Objective (RPO) and Recovery Time Objective (RTO). RPOrepresents the acceptable amount of data loss on a disaster, measured in time. RTO represents theamount of time within which the business must be restored after a disaster.While, a DR strategy focuses on the effectiveness of a recovery process after a disaster, it does not focuson keeping the data available without any down times. Availability is measured as the ratio of mean
  2. 2. time between failures (MTBF) to the sum of MTBF and mean time to repair (MTTR). Thus, availability isindicative of the percentage of time the system is available throughout its useful life. As mentionedearlier, one of the primary goals of disaster recovery strategies is to minimize the RTO (down time).Since MTTR is a measure of the down time and must meet the objectives of RTO, a comprehensivedisaster recovery strategy must also encompass strategies to increase availability. Thus, while DRstrategies are strictly not availability strategies, they do meet availability requirements to an extent. Downtime Downtime Uptime Uptime Uptime Figure 1: System AvailabilityClasses of AvailabilityAvailability is often expressed as a percentage of system availability. Often an availability of about 90-95% is sufficient for most applications. However, for extremely critical business data such amounts ofavailability is simply not enough. As mentioned before, for services such as brokerage services andbusinesses offering online services, a down time of the order of more than a few minutes an year couldhave significant impacts on its operations. For e.g. a 99.9% availability typically means about 9 hours ofdown time per year. The financial and other impacts of such down times could spell trouble for thebusiness. Often, truly highly available solutions have an availability of 99.999% (“five nines”) or99.9999% (“six nines”). Such solutionshave a down time of the order of a fewseconds to a couple of minutes per year.There are different classes of dataprotection mechanisms based on theavailability. Figure 2 shows a pyramid ofvarious data protection strategies. Asone goes up the hierarchy the downtime decreases and hence theavailability increases. The top two levelsof the pyramid constitute strategies thatrepresent true high availability (“fivenines” and “six nines”). Figure 2: Classes of Data Protection
  3. 3. Active/Active Dual Controllers: SBBThe fundamental way to make a storage system highly available is to make each and every componentof the system highly redundant. This includes the processors, memory modules, network and other hostconnectivity ports, power supplies, fans and other components. Apart from these, the drives are alsoconfigured in RAID to ensure drive failuretolerance. However, still the disk arraycontroller (RAID controller) and motherboardof the system constitute single points of failurein the system.Storage Bridge Bay (SBB) is a specificationcreated by a non-profit working group thatdefines a mechanical/electrical interfacebetween a passive backplane drive array andthe electronics packages that give the array its“personality”, thereby standardizing storagecontroller slots. One chassis could have Figure 3: Storage Bridge Baymultiple controllers that can be hot-swapped.This ability to have multiple controllers means that the system is protected against controller failures aswell, thereby giving it a true high availability.But, such a configuration is not bereft of challenges. One of the primary challenges in such a system isdue to the fact that it hosts two intelligent controllers within the same unit that share the common mid-plane and drive array. Since the drive array is shared, the two controllers must exercise a mutualexclusion policy on the drives to ensure that they don’t modify the same data simultaneously causingdata corruptions and inconsistencies. Thus the RAID module on the controllers must be cluster-aware toavoid such collisions and handle conflicts thereof. Further, the two controllers will have their own cacheof the meta-data and data that is stored on these drives. The synchronization between these two cachesneeds to be maintained to ensure that one controller can resume the activities of the other controllerupon its failure.
  4. 4. Dual Redundant Unit IP SAN IP Network SAN Controller SAS X4 Controller Volumes B Cache A RAID GbE Link SAS Expander Shared SAS disks PSU 1 PSU 2 Figure 4: Dual Redundant Controller unitIn order to do this Clustered RAID communication and maintain Cache Coherency, the two controllersneed to have a set of (preferably) dedicated communication channels. A combination of more than onecommunication channels such as SAS fabric, Ethernet connections etc, could be employed here toensure minimal performance impact and redundancies in this communication layer as well. As with alldual redundant “intelligent” clusters, the loss of the inter-node communication could result in the twocontrollers losing cache coherency. Further, as the communication is lost, each controller could try totake up the operation of its peer controller resulting in a split brain scenario. In order to handle this split
  5. 5. brain scenario, the two controllers also need to maintain a quorum using dedicated areas of the shareddrive array to avoid conflicts and data corruptions.The key advantage of such a dual controller setup is that it is almost fully redundant with hot-swappablecomponents. However, despite the controllers being redundant, the mid-plane connecting thecontrollers to the drive back-plane is still shared making it a single point of failure.High Availability ClusterThe highest class of availability in the availability pyramid is achieved using High Availability Clusters. Asmentioned before, while dual controllers areextremely robust and fault tolerant, they are stillsusceptible to mid-plane failures. Apart fromthese, since the drive array is shared between the IP-SANtwo controllers, RAID is the only form of protectionavailable against drive failures. High AvailabilityClusters are cluster of storage nodes that areimplemented by having redundant storage nodeswhich ensure continuity of data availability despitecomponent and a storage node failure as well. Thisrepresents the highest form of availability that ispossible (“six nines”). In comparison to SBB based Figure 5: HA Clusterdual controller nodes, HA Clusters do not sufferfrom any single point of failures. In addition, since the drive arrays are not shared by the two systems,the individual systems have their own RAID configurations, thereby making the HA Cluster resilient tomore drive failures than SBB setup. Unstable data center environments such as rack disturbances arecommon causes of premature drive failures. Dual controller nodes are more prone to system failures compared to HA Clusters in such environments. And finally, HA Client Clusters are also resilient to site (with DSM) failures thus making them the best in class availability solution. However, SBB based dual controller Network Switch units have a lower disk count making them more power efficient and a greener solution with smaller data center footprint. HA Clusters also encounter the split iTX iTX brain syndrome associated with Storage Storage dual controller nodes. However, unlike dual controller nodes, this Figure 6: DSM based HA Cluster
  6. 6. problem cannot be addressed using a quorum disk as the two units do not share the drive array. Oneway to address this problem is to have a client side device specific module (DSM) that performs thequorum action on a split brain. The DSM sits on the path of the IO and decides on the path to send theIO. In addition, it keeps track of the status of the HA Cluster, whether the two nodes are synchronizedand permits a failover action from one system to another only when the two nodes are completelysynchronized.The drawback in having a client-side DSM is that the HA cluster becomes dependent on the client. Also,if the clients are also clustered, then each of the clients needs to have distributed DSMs thatcommunicate amongst each other. A client-agnostic HA Cluster can be created if weunderstand the reasons for a split brain scenario inHA Cluster. Typically, a split brain scenario thatcauses data corruption can occur in a HA Clusterwhen the network path between the storage nodeshave failed severing the communicationbetween the two nodes, while the client itself canaccess both the nodes. In this scenario, both thestorage nodes will try to take-over the clusterownership and Figure 7: Client agnostic HA Configuration unless the client has some way ofknowing the right owner of the cluster, theIOs could potentially be sent to the wrong storage node causing data corruptions. Thus, a HA Clusterwhere the storage nodes have lost contact with each other while the connections from the client toboth the storage nodes are alive is the cause of split brain scenario. As we can see in Figure 6, such asetup is not a true high availability solution as it does not provide path failover capability. Thus, it can beseen true HA setups are not prone to split bran syndromes for HA Clusters. Figure 7 shows one suchnetwork configuration that supports client-agnostic HA Cluster configurations.True High Availability A true highly available data center is one in which every component in the system – not just the storage– are highly available. This includes the client machines, application Servers such as databases, mailservers etc, network switches, and network paths from clients to the application server, and networkpaths from application servers to storage systems. The following figure shows a true HA setup whereevery component in the setup has redundancy built in to it. Thus, a failure in any single component –say, a switch or a network path, does not make the system unavailable.
  7. 7. Figure 8: True High AvailabilitySummaryThus, it can be seen true storage high availability can be only ensured when there is redundancy built into every component of a storage sub-system. Dual redundant controller setup and HA Cluster setups aretwo such setups that deliver the best in class availability. Each of them have their own advantages anddrawbacks, but a combination of these approaches in addition to application server and network pathredundancies deliver a truly highly available data center setup.For More Information about High Availability in Storage Systems ,