Understanding High Availability - Introducing the Theory and Concepts of High Availability


Published on

This 10 page document is intended for technology managers and those working in a current or proposed highly available environment. It provides a starting point in the development of a High Availability architecture by defining terms, detailing concepts and explaining a methodology.

High Availability and Disaster Recovery terms are both defined and contrasted with one another, and an approach is proposed to determine component dependences and analyse internal system functions.

Influences affecting enterprise IT systems in general are explored along with the requirements and assumptions inherent in any High Availability architecture.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Understanding High Availability - Introducing the Theory and Concepts of High Availability

  1. 1. Understanding High AvailabilityIntroducing the Theory and Concepts of High AvailabilityVersion Number: 1.04Status: FinalAuthor: Paul Moore, HA Infrastructure ArchitectDate Published: 21 March 2012 (V1.0), 2 September 2012File Name: Understanding High Availability v1.04.docxCopyright: © 2012 Paul Moore, Astute SystemsLicense: Creative Commons Attribution 3.0 License
  2. 2. Understanding High Availability 2 of 10Acknowledgements:Name ContributionsBrenton Carbins, Socius in Veritas Review, Footnote 4, The Resilient EnterpriseDebbie Moore, The Picky Proofreader Review, ProofreadingiStockPhoto Cover PhotoReviewers:Role Name Review DateInfrastructure Architect Paul Moore 21-Mar-2012Infrastructure Architect Brenton Carbins 16-Mar-2012Reference Documents:Title Author Version LocationThe Resilient Enterprise Richard Barker, Veritas (Symantec) Published 2002 Commercially AvailableContents1 Introduction 32 Definition 33 Costs and Benefits of Availability 34 Prediction 45 Sufficient Understanding 46 A Systems Approach 57 High Availability Calculation 68 Determining Dependencies 79 Architectural Requirements 810 Logical Requirements 811 High Availability Assumptions 812 Architectural Decisions 813 Glossary 10FiguresFigure 1: A Conceptual Graph of Availability versus Cost 3Figure 2: An example of a Black Box system 5Figure 3: An example of Black Box system recursion 5Figure 4: System Availability Calculation 6Figure 5: Sub-systems in an IT System 7© 2012 Paul Moore, Astute Systems Published under the License
  3. 3. Understanding High Availability 3 of 10 1 Introduction To develop any high availability infrastructure it is essential to first understand what high availability is and is not. This document attempts to communicate High Availability concepts in a concise and efficient manner. 2 Definition Disaster Recover and High Availability are related, yet different concepts. They can be summarised as follows:  High Availability is an approach to minimise the probability of a failure to provideHA focuses on an operational service.minimising thechance of a High Availability is the automatic continuation or resumption of service after aservice failure predictable interruption. Example: Disk mirroring continues to provide data in the event of a disk failure. (But does not guarantee the highly available data is uncorrupted.)  Disaster Recovery is an approach to restoring operational service after a failure to provide it due to a predictable or non-predictable event. Disaster Recovery is a system enabling the recovery of services after an interruption due to events not mitigated by a HA system, or due to the failure of a HA system. Example: Backups enable recovery from a service failure due to a data loss. 3 Costs and Benefits of Availability As the level of service availability increases, the cost of the providing it increasesbut increases logarithmically due to the increasing architectural complexity and resource use and ends withcomplexity and the impossibility of providing any increased availability using currently known technology. Ascost hyper- a result, an appropriate balance must be achieved between the costs of implementinglogarithmically availability and the costs of non-availability. 10000 1000 Conceptual Graph Unit $ Cost 100 of Availability vs Cost 10 1 0.1 0.01 0.001 0.0001 99.998% 99.995% 99.993% 99.990% 99.980% 99.950% 99.930% 99.900% 99.800% 99.500% 99.300% 99.000% 98.000% 99.000% 93.000% 90.000% Availability Figure : A Conceptual Graph of Availability versus Cost © 2012 Paul Moore, Astute Systems Published under the License
  4. 4. Understanding High Availability 4 of 10 4 Prediction The architecture of a high availability service requires an assessment and prediction of theA prediction of most likely and frequent causes of potential service interruption and a resultant design tothe statistical enable the service to continue operating when the predicted event occurs.likelihood offuture events These assessments and predictions will invariably differ from the actual occurrence of events observed during future service operation and as a result the actual performance can neverbut availability be guaranteed through the use of any particular architecture, design or implementation. Theis a historic actual future availability of the service will, by definition, be a historic statistical measurementmeasure over a set period of time. A high availability architecture seeks to provide higher functional service level by designingso there are no systems capable of withstanding a range of conceivable failure scenarios, however a perfectguarantees service will never be possible due to limitations imposed by hardware, software, communications, policies, cost and the inherent limited ability to predict the likelihood of future events and their consequences. 5 Sufficient Understanding To design a highly available system, a thorough understanding of its components is requiredThe devil is to the degree that all significant availability risks to the system are understood and managed.always in thedetail … British writer and scientist, Arthur C. Clarke, stated in his third law of prediction: “Any sufficiently advanced technology is indistinguishable from magic.” 1 Adopting the above terminology, all magic must be eliminated from the system through… so eliminate enquiry and investigation.all “magic”. Several tools can assist in gaining this understanding. Where a system contains complexity and where there is a logical layering of component sub-systems, a systems approach is one of the most useful. This approach is outlined in the following section. 1 See “Magic” in the Glossary. © 2012 Paul Moore, Astute Systems Published under the License
  5. 5. Understanding High Availability 5 of 10 6 A Systems Approach 2 In determining a systems level of availability it can be useful to implement a black-boxA ‘black box’ approach. This maximises flexibility by enabling arbitrary boundaries to be drawn to best suitmodel any particular scenario, enforces a rigorous and disciplined focus on the functional requirements of the system and eliminates consideration of unnecessary details which might otherwise complicate the assessment. This system approach and the types of information necessary to use this approach can be best demonstrated using a simple example. The example system takes a two dimensional shape of a particular colour as an input, changes any blue to green, changes any green to red and changes any red to blue, duplicates the shape and vertically flips one of the shapes around its centre of gravity and sends the result to the output. 2D Shape Transformer System Black Box Function: RGB Colour B G,G R,R B; Output Input Duplicate Shape; Vertical Flip One Shape. Properties: 2D Shape, Colour Properties: 2D Shape, Colour Figure : An example of a Black Box system How the system implements its internal functions is unknown and need not be known because all behaviour is fully defined. Consequently the black-box can be used without internal investigation to ease analysis. Investigation of the internal working of the system is required in a number of circumstances,When must the including‘black box’ be  when the system input, output or function is not fully known,opened?  when the system behavior must be validated,  when the system must be assessed for potential failure vulnerabilities, … with the latter being most important when determining or validating system availability. An investigation can be performed by breaking the original black box system into its various functional components, with each of these in turn being considered as individual black box sub-systems as shown in the diagram below. 2D Shape Transformer System Colour Translator System Vertical Flipping System Black Box Black Box Output Input Output Input Shape Merge System Input Black Box Object Cloning System Input Black Box Output Output Input Output Properties: 2D Shape, Colour Properties: 2D Shape, Colour Function: RGB Colour B G,G R,R B, Duplicate Shape, Vertical Flip One Shape Figure : An example of Black Box system recursion In the event that an investigation of one or more of the individual sub-systems is required, an 3 additional level of recursion can be performed on each of them by using the same criteria and method as used for the system as a whole. 2 See “Black Box” in the Glossary. 3 See “Recursion” in the Glossary. © 2012 Paul Moore, Astute Systems Published under the License
  6. 6. Understanding High Availability 6 of 10 7 High Availability Calculation Having established the boundaries of various sub-systems using the systems approach outlined in the previous section, it is now desirable to determine the availability properties of the larger system.Required sub- The availability calculation for a system relies on a statistical treatment of the likelihood ofsystems failures in sub-systems and an assessment of their direct and indirect consequences. Wheredecrease any single sub-system is required for system operation, the availability of the system cannotavailability be higher than the availability of that sub-system. Conversely, where any single sub-system has other redundant systems that allow it to failRedundant without causing the system to fail, the availability of the system is higher than it would be insub-systems the case where the sub-system was non-redundant. These observations and the associatedincrease availability equations can be seen in the diagram below.availability “The availability of a system is the product of the availability of every serial sub- system upon which that system depends, multiplied by the availability derived from the product of the unavailability of each member of a group of redundant parallel sub-systems where the system depends on the availability of that group.” SYSTEM Sub-System 1 Sub-System 2 Sub-System 3 Sub-System 5 Sub-System 6 (Serial) (Serial) (Parallel) (Serial) (Parallel) SS2 Component 1 (Parallel) SS2 Component 5 SS2 Component 2 (Parallel) Sub-System 7 SS5 Component 1 (Parallel) (Serial) SS2 Component 3 (Parallel) Sub-System 4 (Parallel) (Parallel) SS2 Component 4 (Parallel) Sub-System 8 (Parallel) Avail_S = The availability of system “S” as a percentage of a defined time period. Avail_S = Avail_SS1 * Avail_SS2 * Avail_SS(3,4) * Avail_SS5 * Avail_SS(6,7,8) Where: Avail_SS(3,4) = 1 – (1 – Avail_SS3) * (1 – Avail_SS4) Avail_SS(6,7,8) = 1 – (1 – Avail_SS6) * (1 – Avail_SS7) * (1 – Avail_SS8) Avail_SS2 = Avail_SS2C5 * Avail_SS2C(1,2,3,4) Where: Avail_SS2C(1,2,3,4) = 1 – (1 – Avail_SS2C1) * (1 – Avail_SS2C2) * (1 – Avail_SS2C3) * (1 – Avail_SS2C4) Figure : System Availability Calculation The above diagram demonstrates the availability calculation for a system by recursively using the calculation formula for each black box sub-system. For architectural purposes and in the context of information technology, a system isEither working considered to be in either a failed or working state, with the system in the failed state whenor failed, and non-routine staff intervention is required. Nevertheless, from both an external serviceany human availability and management perspective, the staff that intervene to repair the system in theintervention event of sub-system failure could be conceived to be part of the system.means failed. In practice, High Availability design consists of determining optimal sub-system boundaries that make both the understanding and implementation of a system as simple as possible without compromising either the requirements or functionality. © 2012 Paul Moore, Astute Systems Published under the License
  7. 7. Understanding High Availability 7 of 10 8 Determining Dependencies An IT system is comprised of a number of sub-systems, most of which are essential to the system function, and so should be considered as serial dependencies for high availability architecture purposes. Due to the number of unavoidable serial dependencies in the IT system the availability of each sub-system must be maximized through the addition of redundant components within each sub-system. These sub-systems are shown in the diagram below. CAPACITY High Level Function Application CAPACITY Increasing Dependency (Serial) OPERATIONAL Financial Sub ApplicationAn IT system (Serial)can fail at any Core Application Technical Disaster Recovery (Serial)layer, at any Influence Monitoringtime and for Data Storage Security Business (Parallel) (Parallel) (Parallel) (Serial) Disaster Recoverymany different Influence Operating Systemreasons. Monitoring (Serial) Security Support Political Testing Legal (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) Communications DOCUMENTATION (Serial) MANAGEMENT Financial Electrical Support MAINTENANCE Political Testing PERFORMANCE PREVENTION Legal (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Serial) Increasing Abstraction REGULATORY Hardware EXPENDITURE Technical EXTERNAL (Serial) FUNCTION TRAINING Influence CONTRACTUAL DETECTION Temperature PLANNING PLANNING (Serial) Mechanical INTERNAL Business (Serial) Influence Location (Serial) Figure : Sub-systems in an IT System The most dependent layers of the model drive the requirements for those layers upon whichWell designed they depend. For example, if the application instance is able to use an alternate instance of aenvironments core application, the availability of a specific instance of that core application is less critical.need less HA Conversely, when the application instance cannot use an alternate instance of the corework at lower application, that application instance logically cannot have a higher availability than that oflevels of the the core application instance upon which it depends, and consequently, that core applicationstack. instance is critical to the operation of the application instance. In many IT Systems the most critical component is the data storage sub-system, since thereData storage is is often a requirement for a single source of authoritative data upon which to operate. Thisoften critical contrasts with other sub-systems such as location, electrical, communications and hardwaredue to the need which can often be made redundant to form highly available sub-systems.for a singlesource of truth. © 2012 Paul Moore, Astute Systems Published under the License
  8. 8. Understanding High Availability 8 of 10 9 Architectural Requirements  The solution must be analysed for single points of failure which must be addressed.  The solution must be analysed for dependencies which must be addressed.  An estimated availability for the solution must be determined to ensure that this availability figure is consistent with the availability requirements. 10 Logical RequirementsA system As seen in figure 5 on the previous page, the following logical requirements must be met inrequires… order to provide a system capable of meeting the business requirements. The system must be sufficiently documented so as to be supportable and maintainable.documentation There must be sufficient training available for support staff to be able to maintain the systemtraining in a timely manner. The availability of the system must be measurable for function and responsiveness. This willavailability of require the retention of specific metrics.configuration Configuration details of system components must be available in a timely manner so that a failure of a hardware system will not result in the loss of unique configuration information.configurationauditability Configuration details of system components must be auditable so that erroneous administrative configuration changes can be restored in a timely manner. 11 High Availability Assumptions That a failure of a single system must be mitigated against, and that the failure of multiple systems will be considered to be a failure in the larger system. That a system failure during the critical time window will require automated mitigation and that there will be insufficient time for support staff to be notified, respond, analyse and perform reliable mitigation to restore service. That a failure of a system component can occur at any logical level of the IT solution and can include human mistakes. That the system will scale appropriately and that there is sufficient time during any required time window for the systems to perform all required operations. (IE: The system availability is not required to exceed 100 %.) 12 Architectural Decisions By parallelising sub-systems, no single sub-system instance represents a single point of failure and the availability of the system as a whole is increased. Parallelising sub-systems enables the performance of most maintenance activities on individual sub-system instances without the system ceasing to function.Decision  Implement the parallelisation of sub-systems where possible. By distributing load between parallel sub-systems the throughput of the group of parallel sub-system instances is higher than it would be for a single sub-system instance. Distributing the load between parallel sub-system instances leverages the investment in hardware and software.Decision  Distribute load between parallel sub-systems where possible. By monitoring the responsiveness of parallel sub-systems, traffic can be directed to responsive instances and away from unresponsive or failed ones. When traffic is routed centrally, effective service delivery is maximised by minimising the duration that traffic is routed to unresponsive or failed parallel sub-system instances.Decision  Monitor the responsiveness of parallel sub-systems where possible. By using a clustered file system for all sub-system configurations, configuration files can be more easily managed. In the event of the failure of a sub-system, the unique © 2012 Paul Moore, Astute Systems Published under the License
  9. 9. Understanding High Availability 9 of 10 sub-system instances configuration files are less likely to be lost and are more rapidly available when a new sub-system instance is deployed as part of a disaster recovery plan. The clustered configuration file system serves as a highly available single source of critical configuration data which cannot be stored in the database. The clustered configuration file system can also be used to enable rapid and automated recovery for active/standby sub-systems that maintain state information outside of the database.Decision  Implement a shared file system to all sub-systems for configuration management. By using a clustered file system for all sub-system configurations, configuration files can be more easily managed4. As human configuration mistakes are a common cause of IT system failure, managing configuration files in a simple, logical, centralised, auditable and consistent manner is one way to increase availability by decreasing the chances of mistakes and decreasing the time taken to recovery from them.Decision  Implement a version control system for all sub-system configuration files. As sub-systems may fail for unknown reasons, availability can be maximised by restarting failed sub-system processes on the same or on alternate machines. Cluster management software, such as Veritas Cluster Server, can automate the execution of these pre-planned mitigation decisions. The clustering software will provide a global view of the availability and status of all services running on both primary and disaster recovery sites. An administrator must be able to easily fail-over sub-systems or the entire system from the primary to the disaster recovery site and back again. The use of cluster management software is most critical for non-parallelisable sub- systems upon which the entire system is dependent.  Implement cluster management software to automatically restart failed sub-systems.Decision Inter-site data replication is necessary to provide a remote copy of data for disaster recovery and high availability. This can be performed by a hardware solution or on a file system level.  Implement inter-site data replication.Decision 4 This is always a cause of contentious ‘camps’ in architectural discussions. One position is that ‘running configuration’ instances should use identical configuration files, while ‘individual configuration file’ proponents maintain that shared configuration files make upgrades more difficult. A possible compromise is the use of snapshots and/or altered mount details only during upgrade procedures. © 2012 Paul Moore, Astute Systems Published under the License
  10. 10. Understanding High Availability 10 of 1013 GlossaryTerm DescriptionActive-Standby, Hot/Active: Actively processing data. Warm/Standby: Processing capability on standby.Active-Active Active-Standby or Hot-Warm is defined as a model where the production applicationHot-Warm, instance or facility (Active or Hot) will provide operational services in a business as usual state while a disaster recovery application instance or facility (Standby or Warm) is availableHot-Hot to take over service provision in the event of a failure in production.Black Box In science and engineering, a black box is a device, system or object which can be viewed solely in terms of its input, output and transfer characteristics without any knowledge of its internal workings, that is, its implementation is "opaque" (black). (Wikipedia)Database Database replication can be used on many database management systems, usually with aReplication active-standby relationship between the original and the copies. The active logs the updates, which then ripple through to the standby copies. The standby acknowledges that it has received the update successfully, thus allowing the sending (and potentially re-sending until successfully applied) of subsequent updates. Database replication provides a higher level of reporting than log shipping; but does not lock passive databases from user changes and so is unsuitable for failover. (Wikipedia)Disaster Disaster Recovery is a system enabling the recovery of services after an interruption due toRecovery events not mitigated by a High Availability system, or due to High Availability system failure.High Availability High Availability is the automatic continuation or resumption of service after a predictable interruption.Log Shipping Log shipping is the process of automating the backup of a database and transaction log files on a primary database server, and then restoring them onto a standby server. Similar to Database Replication, the primary purpose of log shipping is to increase database availability by maintaining a backup server to quickly replace the primary server. Log Shipping locks the standby database from user changes and is often chosen for its low cost in human and server resources and ease of implementation. Failover between primary and standby servers is manual and limited reporting capabilities are possible. (Wikipedia)Magic In the context of programming, Magic is an informal term for the use of code that handles complex tasks while hiding that complexity to present a simple interface. (Wikipedia) In computer system design, Magic is used as an informal term to describe gaps in understanding the process of interaction between one system and another.Oracle Streams Oracle Streams is available on Enterprise Edition systems only and enables propagation of information within and between Oracle and other databases. Oracle announced Streams deprecation and now encourages usage of Golden Gate (acquired by Oracle in July 2009).Recursion Recursion is the process of repeating items in a self-similar way. For instance, when the surfaces of two mirrors are exactly parallel with each other the nested images that occur are a form of infinite recursion. The term has a variety of meanings specific to a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics and computer science, in which it refers to a method of defining functions in which the function being defined is applied within its own definition. Specifically this defines an infinite number of instances (function values), using a finite expression that for some instances may refer to other instances, but in such a way that no loop or infinite chain of references can occur. The term is also used more generally to describe a process of repeating objects in a self-similar way. (Wikipedia)The Resilient The Resilient Enterprise is a well-known reference book on high availability and disasterEnterprise recovery published by Veritas Software (now Symantec) in 2002.Veritas Cluster Veritas Cluster Server is High-availability cluster software, for Unix, Linux and MicrosoftServer Windows computer systems, created by Veritas Software (now part of Symantec). It provides application cluster capabilities to systems running databases, file sharing on a network, electronic commerce websites or other applications. Veritas Cluster Server is one of the few products in the industry that provides both high availability and disaster recovery across all major operating systems while supporting 40+ major application / replication technologies out of the box. Similar products include Fujitsu PRIMECLUSTER, IBM HACMP, HP Serviceguard, IBM Tivoli System Automation for Multiplatforms, Linux-HA, Microsoft Cluster Server, NEC ExpressCluster, Red Hat Cluster Suite, SteelEye LifeKeeper and Sun Cluster. (Wikipedia)© 2012 Paul Moore, Astute Systems Published under the License