SlideShare a Scribd company logo
1 of 10
Download to read offline
Understanding
  High Availability
Introducing the Theory and Concepts of High Availability




Version Number:       1.04


Status:               Final


Author:               Paul Moore, HA Infrastructure Architect


Date Published:       21 March 2012 (V1.0), 2 September 2012


File Name:            Understanding High Availability v1.04.docx


Copyright:            © 2012 Paul Moore, Astute Systems


License:              Creative Commons Attribution 3.0 License
Understanding High Availability                                                               2 of 10




Acknowledgements:
Name                                                 Contributions
Brenton Carbins, Socius in Veritas                   Review, Footnote 4, The Resilient Enterprise
Debbie Moore, The Picky Proofreader                  Review, Proofreading
iStockPhoto                                          Cover Photo



Reviewers:
Role                                             Name                     Review Date
Infrastructure Architect                         Paul Moore               21-Mar-2012
Infrastructure Architect                         Brenton Carbins          16-Mar-2012



Reference Documents:
Title                        Author                        Version           Location
The Resilient Enterprise Richard Barker, Veritas (Symantec) Published 2002 Commercially Available




Contents
1       Introduction                                                                                 3
2       Definition                                                                                   3
3       Costs and Benefits of Availability                                                           3
4       Prediction                                                                                   4
5       Sufficient Understanding                                                                     4
6       A Systems Approach                                                                           5
7       High Availability Calculation                                                                6
8       Determining Dependencies                                                                     7
9       Architectural Requirements                                                                   8
10      Logical Requirements                                                                         8
11      High Availability Assumptions                                                                8
12      Architectural Decisions                                                                      8
13      Glossary                                                                                   10


Figures
Figure 1: A Conceptual Graph of Availability versus Cost                                             3
Figure 2: An example of a Black Box system                                                           5
Figure 3: An example of Black Box system recursion                                                   5
Figure 4: System Availability Calculation                                                            6
Figure 5: Sub-systems in an IT System                                                                7




© 2012 Paul Moore, Astute Systems                                           Published under the License
Understanding High Availability                                                                                                                                                           3 of 10


           1         Introduction
                     To develop any high availability infrastructure it is essential to first understand what high
                     availability is and is not. This document attempts to communicate High Availability concepts
                     in a concise and efficient manner.

           2         Definition
                     Disaster Recover and High Availability are related, yet different concepts. They can be
                     summarised as follows:
                                  High Availability is an approach to minimise the probability of a failure to provide
HA focuses on                      an operational service.
minimising the
chance of a                        High Availability is the automatic continuation or resumption of service after a
service failure                    predictable interruption.

                                   Example: Disk mirroring continues to provide data in the event of a disk failure.
                                            (But does not guarantee the highly available data is uncorrupted.)

                                  Disaster Recovery is an approach to restoring operational service after a failure
                                   to provide it due to a predictable or non-predictable event.

                                   Disaster Recovery is a system enabling the recovery of services after an
                                   interruption due to events not mitigated by a HA system, or due to the failure of a
                                   HA system.

                                   Example: Backups enable recovery from a service failure due to a data loss.


           3         Costs and Benefits of Availability
                     As the level of service availability increases, the cost of the providing it increases
but increases        logarithmically due to the increasing architectural complexity and resource use and ends with
complexity and       the impossibility of providing any increased availability using currently known technology. As
cost     hyper-      a result, an appropriate balance must be achieved between the costs of implementing
logarithmically      availability and the costs of non-availability.

                            10000
                              1000                                                              Conceptual Graph
                            Unit $ Cost




                               100                                                              of Availability vs Cost
                                 10
                                  1
                                0.1
                               0.01
                             0.001
                            0.0001
                                              99.998%

                                                        99.995%

                                                                  99.993%

                                                                            99.990%

                                                                                      99.980%

                                                                                                 99.950%

                                                                                                           99.930%

                                                                                                                     99.900%

                                                                                                                               99.800%

                                                                                                                                         99.500%

                                                                                                                                                   99.300%

                                                                                                                                                             99.000%

                                                                                                                                                                       98.000%

                                                                                                                                                                                 99.000%

                                                                                                                                                                                           93.000%

                                                                                                                                                                                                       90.000%




                                                                                                Availability
                                           Figure : A Conceptual Graph of Availability versus Cost




           © 2012 Paul Moore, Astute Systems                                                                                                                 Published under the License
Understanding High Availability                                                                        4 of 10


            4         Prediction
                      The architecture of a high availability service requires an assessment and prediction of the
A prediction of       most likely and frequent causes of potential service interruption and a resultant design to
the statistical       enable the service to continue operating when the predicted event occurs.
likelihood    of
future events         These assessments and predictions will invariably differ from the actual occurrence of events
                      observed during future service operation and as a result the actual performance can never
but availability      be guaranteed through the use of any particular architecture, design or implementation. The
is a historic         actual future availability of the service will, by definition, be a historic statistical measurement
measure               over a set period of time.
                      A high availability architecture seeks to provide higher functional service level by designing
so there are no       systems capable of withstanding a range of conceivable failure scenarios, however a perfect
guarantees            service will never be possible due to limitations imposed by hardware, software,
                      communications, policies, cost and the inherent limited ability to predict the likelihood of
                      future events and their consequences.

            5         Sufficient Understanding
                      To design a highly available system, a thorough understanding of its components is required
The devil is          to the degree that all significant availability risks to the system are understood and managed.
always in the
detail …              British writer and scientist, Arthur C. Clarke, stated in his third law of prediction:


                              “Any sufficiently advanced technology is indistinguishable from magic.”
                                                                      1
                      Adopting the above terminology, all magic must be eliminated from the system through
… so eliminate        enquiry and investigation.
all “magic”.
                      Several tools can assist in gaining this understanding. Where a system contains complexity
                      and where there is a logical layering of component sub-systems, a systems approach is one
                      of the most useful. This approach is outlined in the following section.




            1
                See “Magic” in the Glossary.



            © 2012 Paul Moore, Astute Systems                                                    Published under the License
Understanding High Availability                                                                                                                             5 of 10


          6         A Systems Approach
                                                                                                                                                                                 2
                    In determining a systems level of availability it can be useful to implement a black-box
A ‘black box’       approach. This maximises flexibility by enabling arbitrary boundaries to be drawn to best suit
model               any particular scenario, enforces a rigorous and disciplined focus on the functional
                    requirements of the system and eliminates consideration of unnecessary details which might
                    otherwise complicate the assessment.
                    This system approach and the types of information necessary to use this approach can be
                    best demonstrated using a simple example. The example system takes a two dimensional
                    shape of a particular colour as an input, changes any blue to green, changes any green to
                    red and changes any red to blue, duplicates the shape and vertically flips one of the shapes
                    around its centre of gravity and sends the result to the output.

                                                         2D Shape Transformer System
                                                                                     Black Box


                                                                  Function: RGB Colour B G,G R,R                                    B;                     Output
                                   Input                                    Duplicate Shape;
                                                                            Vertical Flip One Shape.
                          Properties: 2D Shape, Colour                                                                                            Properties: 2D Shape, Colour

                                              Figure : An example of a Black Box system

                    How the system implements its internal functions is unknown and need not be known
                    because all behaviour is fully defined. Consequently the black-box can be used without
                    internal investigation to ease analysis.
                    Investigation of the internal working of the system is required in a number of circumstances,
When must the       including
‘black box’ be              when the system input, output or function is not fully known,
opened?                     when the system behavior must be validated,
                            when the system must be assessed for potential failure vulnerabilities,

                    … with the latter being most important when determining or validating system availability.
                    An investigation can be performed by breaking the original black box system into its various
                    functional components, with each of these in turn being considered as individual black box
                    sub-systems as shown in the diagram below.

                                                         2D Shape Transformer System
                                                         Colour Translator System         Vertical Flipping System
                                                               Black Box                         Black Box             Output
                                                                                                                                          Input
                                                                            Output




                                                                                            Input




                                                                                                                     Shape Merge System
                                                          Input




                                                                                                                           Black Box

                                                                          Object Cloning System
                                   Input                                       Black Box
                                                                                                                                                           Output
                                                                                                                      Output




                                                                  Input                             Output
                          Properties: 2D Shape, Colour                                                                                            Properties: 2D Shape, Colour

                                                                    Function: RGB Colour B G,G R,R                                    B,
                                                                              Duplicate Shape,
                                                                              Vertical Flip One Shape
                                        Figure : An example of Black Box system recursion

                    In the event that an investigation of one or more of the individual sub-systems is required, an
                                                 3
                    additional level of recursion can be performed on each of them by using the same criteria
                    and method as used for the system as a whole.


          2
              See “Black Box” in the Glossary.
          3
              See “Recursion” in the Glossary.



          © 2012 Paul Moore, Astute Systems                                                                                                       Published under the License
Understanding High Availability                                                                                                                    6 of 10


           7         High Availability Calculation
                     Having established the boundaries of various sub-systems using the systems approach
                     outlined in the previous section, it is now desirable to determine the availability properties of
                     the larger system.
Required sub-
                     The availability calculation for a system relies on a statistical treatment of the likelihood of
systems
                     failures in sub-systems and an assessment of their direct and indirect consequences. Where
decrease
                     any single sub-system is required for system operation, the availability of the system cannot
availability
                     be higher than the availability of that sub-system.
                     Conversely, where any single sub-system has other redundant systems that allow it to fail
Redundant            without causing the system to fail, the availability of the system is higher than it would be in
sub-systems          the case where the sub-system was non-redundant. These observations and the associated
increase             availability equations can be seen in the diagram below.
availability
                           “The availability of a system is the product of the availability of every serial sub-
                           system upon which that system depends, multiplied by the availability derived
                           from the product of the unavailability of each member of a group of redundant
                           parallel sub-systems where the system depends on the availability of that group.”

                                                                                                 SYSTEM
                            Sub-System 1          Sub-System 2                                   Sub-System 3    Sub-System 5               Sub-System 6
                                  (Serial)               (Serial)                                   (Parallel)        (Serial)                   (Parallel)
                                                  SS2 Component 1
                                                       (Parallel)
                                                                               SS2 Component 5




                                                  SS2 Component 2
                                                       (Parallel)
                                                                                                                                            Sub-System 7
                                                                                                                   SS5 Component 1               (Parallel)
                                                                    (Serial)




                                                  SS2 Component 3
                                                       (Parallel)
                                                                                                 Sub-System 4           (Parallel)



                                                                                                    (Parallel)
                                                  SS2 Component 4
                                                       (Parallel)
                                                                                                                                            Sub-System 8
                                                                                                                                                 (Parallel)



                           Avail_S = The availability of system “S” as a percentage of a defined time period.
                           Avail_S = Avail_SS1 * Avail_SS2 * Avail_SS(3,4) * Avail_SS5 * Avail_SS(6,7,8)
                           Where:
                              Avail_SS(3,4)               = 1 – (1 – Avail_SS3) * (1 – Avail_SS4)
                              Avail_SS(6,7,8)             = 1 – (1 – Avail_SS6) * (1 – Avail_SS7) * (1 – Avail_SS8)
                              Avail_SS2                   = Avail_SS2C5 * Avail_SS2C(1,2,3,4)
                              Where:
                                  Avail_SS2C(1,2,3,4) = 1 – (1 – Avail_SS2C1) * (1 – Avail_SS2C2) * (1 – Avail_SS2C3) * (1 – Avail_SS2C4)

                                                 Figure : System Availability Calculation

                     The above diagram demonstrates the availability calculation for a system by recursively
                     using the calculation formula for each black box sub-system.
                     For architectural purposes and in the context of information technology, a system is
Either working       considered to be in either a failed or working state, with the system in the failed state when
or failed, and       non-routine staff intervention is required. Nevertheless, from both an external service
any     human        availability and management perspective, the staff that intervene to repair the system in the
intervention         event of sub-system failure could be conceived to be part of the system.
means failed.        In practice, High Availability design consists of determining optimal sub-system boundaries
                     that make both the understanding and implementation of a system as simple as possible
                     without compromising either the requirements or functionality.




           © 2012 Paul Moore, Astute Systems                                                                                         Published under the License
Understanding High Availability                                                                                                                                                                                                                              7 of 10


            8         Determining Dependencies
                      An IT system is comprised of a number of sub-systems, most of which are essential to the
                      system function, and so should be considered as serial dependencies for high availability
                      architecture purposes. Due to the number of unavoidable serial dependencies in the IT
                      system the availability of each sub-system must be maximized through the addition of
                      redundant components within each sub-system. These sub-systems are shown in the
                      diagram below.




                                                                                                                                                                                                                                                              CAPACITY
                                                                                                                                                               High Level Function
                                                                                                                                                            Application




                                                                                                                                                                                                                                CAPACITY
                                                                                                                                                                             Increasing Dependency
                                                                                                                                                               (Serial)
                                                   OPERATIONAL Financial
                                                                                                                                                          Sub Application
An IT system                                                                                                                                                   (Serial)

can fail at any                                                                                                                                           Core Application                                                   Technical




                                                                                                                                                                                                                                                                            Disaster Recovery
                                                                                                                                                               (Serial)
layer, at any                                                                                                                                                                                                                Influence




                                                                                                                                                                                                                                Monitoring
time and for                                                                                                                                               Data Storage




                                                                                                                                                                                                                                 Security
                                                 Business




                                                                                                                                                                                                                                           (Parallel)




                                                                                                                                                                                                                                  (Parallel)




                                                                                                                                                                                                                                                                                  (Parallel)
                                                                                                                                                               (Serial)




                                                                                                                                                                                                                                    Disaster Recovery
many different                                   Influence
                                                                                                                                                   Operating System
reasons.




                                                                                                                                                                                                                                             Monitoring
                                                                                                                                                               (Serial)




                                                                                                                                                                                                                  Security
                                                               Support
                                                               Political




                                                                                                                       Testing
                                                                 Legal
                                                               (Parallel)




                                                                 (Parallel)




                                                                                                          (Parallel)




                                                                                                                                                                                                     (Parallel)




                                                                                                                                                                                                                                              (Parallel)
                                                                 (Parallel)




                                                                 (Parallel)




                                                                                                                                                                                                                                               (Parallel)
                                                                                                                                                          Communications




                                                                                                                                                                                                                                                                                             DOCUMENTATION
                                                                                                                                                               (Serial)




                                                                                                                                                                                                                                                         MANAGEMENT
                                                                            Financial




                                                                                                                                                             Electrical
                                                                            Support MAINTENANCE
                                                                            Political




                                                                             Testing PERFORMANCE




                                                                                                                                                                                                                                                          PREVENTION
                                         Legal
                            (Parallel)




                                                  (Parallel)




                                                                              (Parallel)




                                                                              (Parallel)




                                                                              (Parallel)




                                                                                                                                                               (Serial)
                                                                                                                                 Increasing Abstraction
                                                                                 REGULATORY




                                                                                                                                                             Hardware
                                                                                       EXPENDITURE




                                                                                                                                                                                                                             Technical
                                                                                         EXTERNAL




                                                                                                                                                               (Serial)
                                                                                                           FUNCTION
                                                                                          TRAINING




                                                                                                                                                                                                                             Influence
                             CONTRACTUAL




                                                                                                                                                                                                      DETECTION
                                                                                                                                                           Temperature




                                                                                                                                                                                                                                                                 PLANNING

                                                                                                                                                                                                                                                                 PLANNING
                                                                                                                                                               (Serial)

                                                                                                                                                            Mechanical
                                                                                               INTERNAL




                                                 Business                                                                                                      (Serial)
                                                 Influence
                                                                                                                                                             Location
                                                                                                                                                               (Serial)




                                                                                Figure : Sub-systems in an IT System

                      The most dependent layers of the model drive the requirements for those layers upon which
Well designed         they depend. For example, if the application instance is able to use an alternate instance of a
environments          core application, the availability of a specific instance of that core application is less critical.
need less HA          Conversely, when the application instance cannot use an alternate instance of the core
work at lower         application, that application instance logically cannot have a higher availability than that of
levels of the         the core application instance upon which it depends, and consequently, that core application
stack.                instance is critical to the operation of the application instance.
                      In many IT Systems the most critical component is the data storage sub-system, since there
Data storage is       is often a requirement for a single source of authoritative data upon which to operate. This
often    critical     contrasts with other sub-systems such as location, electrical, communications and hardware
due to the need       which can often be made redundant to form highly available sub-systems.
for a single
source of truth.




            © 2012 Paul Moore, Astute Systems                                                                                                                                                           Published under the License
Understanding High Availability                                                                     8 of 10


           9         Architectural Requirements
                             The solution must be analysed for single points of failure which must be addressed.
                             The solution must be analysed for dependencies which must be addressed.
                             An estimated availability for the solution must be determined to ensure that this
                              availability figure is consistent with the availability requirements.

           10        Logical Requirements
A       system       As seen in figure 5 on the previous page, the following logical requirements must be met in
requires…            order to provide a system capable of meeting the business requirements.
                     The system must be sufficiently documented so as to be supportable and maintainable.
documentation
                     There must be sufficient training available for support staff to be able to maintain the system
training             in a timely manner.
                     The availability of the system must be measurable for function and responsiveness. This will
availability of      require the retention of specific metrics.
configuration        Configuration details of system components must be available in a timely manner so that a
                     failure of a hardware system will not result in the loss of unique configuration information.
configuration
auditability         Configuration details of system components must be auditable so that erroneous
                     administrative configuration changes can be restored in a timely manner.

           11        High Availability Assumptions
                     That a failure of a single system must be mitigated against, and that the failure of multiple
                     systems will be considered to be a failure in the larger system.
                     That a system failure during the critical time window will require automated mitigation and
                     that there will be insufficient time for support staff to be notified, respond, analyse and
                     perform reliable mitigation to restore service.
                     That a failure of a system component can occur at any logical level of the IT solution and can
                     include human mistakes.
                     That the system will scale appropriately and that there is sufficient time during any required
                     time window for the systems to perform all required operations. (IE: The system availability is
                     not required to exceed 100 %.)

           12        Architectural Decisions
                              By parallelising sub-systems, no single sub-system instance represents a single
                              point of failure and the availability of the system as a whole is increased.
                              Parallelising sub-systems enables the performance of most maintenance activities
                              on individual sub-system instances without the system ceasing to function.
Decision                     Implement the parallelisation of sub-systems where possible.

                              By distributing load between parallel sub-systems the throughput of the group of
                              parallel sub-system instances is higher than it would be for a single sub-system
                              instance.
                              Distributing the load between parallel sub-system instances leverages the
                              investment in hardware and software.
Decision                     Distribute load between parallel sub-systems where possible.

                              By monitoring the responsiveness of parallel sub-systems, traffic can be directed to
                              responsive instances and away from unresponsive or failed ones.
                              When traffic is routed centrally, effective service delivery is maximised by minimising
                              the duration that traffic is routed to unresponsive or failed parallel sub-system
                              instances.
Decision
                             Monitor the responsiveness of parallel sub-systems where possible.

                              By using a clustered file system for all sub-system configurations, configuration files
                              can be more easily managed. In the event of the failure of a sub-system, the unique


           © 2012 Paul Moore, Astute Systems                                                 Published under the License
Understanding High Availability                                                                        9 of 10


                              sub-system instances configuration files are less likely to be lost and are more
                              rapidly available when a new sub-system instance is deployed as part of a disaster
                              recovery plan.
                              The clustered configuration file system serves as a highly available single source of
                              critical configuration data which cannot be stored in the database. The clustered
                              configuration file system can also be used to enable rapid and automated recovery
                              for active/standby sub-systems that maintain state information outside of the
                              database.
Decision
                             Implement a shared file system to all sub-systems for configuration management.

                              By using a clustered file system for all sub-system configurations, configuration files
                              can be more easily managed4. As human configuration mistakes are a common
                              cause of IT system failure, managing configuration files in a simple, logical,
                              centralised, auditable and consistent manner is one way to increase availability by
                              decreasing the chances of mistakes and decreasing the time taken to recovery from
                              them.
Decision
                             Implement a version control system for all sub-system configuration files.

                              As sub-systems may fail for unknown reasons, availability can be maximised by
                              restarting failed sub-system processes on the same or on alternate machines.
                              Cluster management software, such as Veritas Cluster Server, can automate the
                              execution of these pre-planned mitigation decisions.
                              The clustering software will provide a global view of the availability and status of all
                              services running on both primary and disaster recovery sites. An administrator must
                              be able to easily fail-over sub-systems or the entire system from the primary to the
                              disaster recovery site and back again.
                              The use of cluster management software is most critical for non-parallelisable sub-
                              systems upon which the entire system is dependent.
                             Implement cluster management software to automatically restart failed sub-systems.
Decision
                              Inter-site data replication is necessary to provide a remote copy of data for disaster
                              recovery and high availability. This can be performed by a hardware solution or on a
                              file system level.
                             Implement inter-site data replication.
Decision




           4
               This is always a cause of contentious ‘camps’ in architectural discussions. One position is that ‘running
               configuration’ instances should use identical configuration files, while ‘individual configuration file’
               proponents maintain that shared configuration files make upgrades more difficult. A possible compromise is
               the use of snapshots and/or altered mount details only during upgrade procedures.


           © 2012 Paul Moore, Astute Systems                                                    Published under the License
Understanding High Availability                                                                           10 of 10


13         Glossary
Term                   Description
Active-Standby,       Hot/Active: Actively processing data.        Warm/Standby: Processing capability on standby.
Active-Active         Active-Standby or Hot-Warm is defined as a model where the production application
Hot-Warm,             instance or facility (Active or Hot) will provide operational services in a business as usual
                      state while a disaster recovery application instance or facility (Standby or Warm) is available
Hot-Hot               to take over service provision in the event of a failure in production.
Black Box             In science and engineering, a black box is a device, system or object which can be viewed
                      solely in terms of its input, output and transfer characteristics without any knowledge of its
                      internal workings, that is, its implementation is "opaque" (black). (Wikipedia)
Database              Database replication can be used on many database management systems, usually with a
Replication           active-standby relationship between the original and the copies. The active logs the updates,
                      which then ripple through to the standby copies.
                      The standby acknowledges that it has received the update successfully, thus allowing the
                      sending (and potentially re-sending until successfully applied) of subsequent updates.
                      Database replication provides a higher level of reporting than log shipping; but does not lock
                      passive databases from user changes and so is unsuitable for failover. (Wikipedia)
Disaster          Disaster Recovery is a system enabling the recovery of services after an interruption due to
Recovery          events not mitigated by a High Availability system, or due to High Availability system failure.
High Availability High Availability is the automatic continuation or resumption of service after a predictable
                      interruption.
Log Shipping          Log shipping is the process of automating the backup of a database and transaction log files
                      on a primary database server, and then restoring them onto a standby server.
                      Similar to Database Replication, the primary purpose of log shipping is to increase database
                      availability by maintaining a backup server to quickly replace the primary server.
                      Log Shipping locks the standby database from user changes and is often chosen for its low
                      cost in human and server resources and ease of implementation. Failover between primary
                      and standby servers is manual and limited reporting capabilities are possible. (Wikipedia)
Magic                 In the context of programming, Magic is an informal term for the use of code that handles
                      complex tasks while hiding that complexity to present a simple interface. (Wikipedia)
                      In computer system design, Magic is used as an informal term to describe gaps in
                      understanding the process of interaction between one system and another.
Oracle Streams        Oracle Streams is available on Enterprise Edition systems only and enables propagation of
                      information within and between Oracle and other databases. Oracle announced Streams
                      deprecation and now encourages usage of Golden Gate (acquired by Oracle in July 2009).
Recursion             Recursion is the process of repeating items in a self-similar way. For instance, when the
                      surfaces of two mirrors are exactly parallel with each other the nested images that occur are
                      a form of infinite recursion. The term has a variety of meanings specific to a variety of
                      disciplines ranging from linguistics to logic.
                      The most common application of recursion is in mathematics and computer science, in
                      which it refers to a method of defining functions in which the function being defined is
                      applied within its own definition.
                      Specifically this defines an infinite number of instances (function values), using a finite
                      expression that for some instances may refer to other instances, but in such a way that no
                      loop or infinite chain of references can occur. The term is also used more generally to
                      describe a process of repeating objects in a self-similar way. (Wikipedia)
The Resilient         The Resilient Enterprise is a well-known reference book on high availability and disaster
Enterprise            recovery published by Veritas Software (now Symantec) in 2002.
Veritas Cluster       Veritas Cluster Server is High-availability cluster software, for Unix, Linux and Microsoft
Server                Windows computer systems, created by Veritas Software (now part of Symantec). It
                      provides application cluster capabilities to systems running databases, file sharing on a
                      network, electronic commerce websites or other applications.
                      Veritas Cluster Server is one of the few products in the industry that provides both high
                      availability and disaster recovery across all major operating systems while supporting 40+
                      major application / replication technologies out of the box.
                      Similar products include Fujitsu PRIMECLUSTER, IBM HACMP, HP Serviceguard, IBM
                      Tivoli System Automation for Multiplatforms, Linux-HA, Microsoft Cluster Server, NEC
                      ExpressCluster, Red Hat Cluster Suite, SteelEye LifeKeeper and Sun Cluster. (Wikipedia)




© 2012 Paul Moore, Astute Systems                                                        Published under the License

More Related Content

Viewers also liked

Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015 Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015 Mariano Gonzalez
 
High Availability - How to get 99.99% service availabilty - Designing cluster...
High Availability - How to get 99.99% service availabilty - Designing cluster...High Availability - How to get 99.99% service availabilty - Designing cluster...
High Availability - How to get 99.99% service availabilty - Designing cluster...Barcamp Saigon
 
Information security group presentation ppt
Information security group presentation   pptInformation security group presentation   ppt
Information security group presentation pptvaishalshah01
 
InformationSecurity
InformationSecurityInformationSecurity
InformationSecuritylearnt
 
Introduction to Information Security
Introduction to Information SecurityIntroduction to Information Security
Introduction to Information SecurityDr. Loganathan R
 
Presentation on Total Productive Maintenance
Presentation on Total Productive MaintenancePresentation on Total Productive Maintenance
Presentation on Total Productive MaintenanceMahendra K SHUKLA
 
INFORMATION SECURITY
INFORMATION SECURITYINFORMATION SECURITY
INFORMATION SECURITYAhmed Moussa
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Viewers also liked (13)

Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015 Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
Zero Downtime with OSGi - Chicago Coder Conference 05-15-2015
 
High Availability - How to get 99.99% service availabilty - Designing cluster...
High Availability - How to get 99.99% service availabilty - Designing cluster...High Availability - How to get 99.99% service availabilty - Designing cluster...
High Availability - How to get 99.99% service availabilty - Designing cluster...
 
Availability and Business Resiliency Strategies
Availability and Business Resiliency StrategiesAvailability and Business Resiliency Strategies
Availability and Business Resiliency Strategies
 
Motor mechanics (1)
Motor mechanics (1)Motor mechanics (1)
Motor mechanics (1)
 
Information security group presentation ppt
Information security group presentation   pptInformation security group presentation   ppt
Information security group presentation ppt
 
InformationSecurity
InformationSecurityInformationSecurity
InformationSecurity
 
Introduction to Information Security
Introduction to Information SecurityIntroduction to Information Security
Introduction to Information Security
 
Reliability centered maintenance
Reliability centered maintenanceReliability centered maintenance
Reliability centered maintenance
 
Presentation on Total Productive Maintenance
Presentation on Total Productive MaintenancePresentation on Total Productive Maintenance
Presentation on Total Productive Maintenance
 
Maintenence management
Maintenence managementMaintenence management
Maintenence management
 
INFORMATION SECURITY
INFORMATION SECURITYINFORMATION SECURITY
INFORMATION SECURITY
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Understanding High Availability - Introducing the Theory and Concepts of High Availability

Accelerating the Speed of Innovation - Jason Waxman, Intel
Accelerating the Speed of Innovation - Jason Waxman, IntelAccelerating the Speed of Innovation - Jason Waxman, Intel
Accelerating the Speed of Innovation - Jason Waxman, IntelOpen Data Center Alliance
 
CCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCloud Congress
 
Devopsdays Enstratus Overview
Devopsdays Enstratus OverviewDevopsdays Enstratus Overview
Devopsdays Enstratus OverviewJohn Willis
 
Government cloud deployment lessons learned final (4 4 2013)
Government cloud deployment lessons learned final (4 4 2013)Government cloud deployment lessons learned final (4 4 2013)
Government cloud deployment lessons learned final (4 4 2013)GovCloud Network
 
Intro to Cloud Computing in the Federal Government
Intro to Cloud Computing in the Federal GovernmentIntro to Cloud Computing in the Federal Government
Intro to Cloud Computing in the Federal GovernmentIntel Corporation
 
Cloud Lock-in vs. Cloud Interoperability - Indicthreads cloud computing conf...
Cloud Lock-in vs. Cloud Interoperability  - Indicthreads cloud computing conf...Cloud Lock-in vs. Cloud Interoperability  - Indicthreads cloud computing conf...
Cloud Lock-in vs. Cloud Interoperability - Indicthreads cloud computing conf...IndicThreads
 
Assessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsAssessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsJoão Gabriel Lima
 
Standardization Activities on Cloud Computing
Standardization Activities on Cloud ComputingStandardization Activities on Cloud Computing
Standardization Activities on Cloud ComputingSeungyun Lee
 
Dell and OpenStack
Dell and OpenStackDell and OpenStack
Dell and OpenStackeNovance
 
The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...
The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...
The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...Everest Group
 
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...SlideTeam
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
 
Virtualization Management With Quest V Foglight
Virtualization Management With Quest V FoglightVirtualization Management With Quest V Foglight
Virtualization Management With Quest V FoglightChris Roberts
 
Cloud: CDN Killer?
Cloud: CDN Killer? Cloud: CDN Killer?
Cloud: CDN Killer? Internap
 
Driving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete DeckDriving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete DeckSlideTeam
 

Similar to Understanding High Availability - Introducing the Theory and Concepts of High Availability (20)

Stream 1 - Cloud Computing
Stream 1 - Cloud ComputingStream 1 - Cloud Computing
Stream 1 - Cloud Computing
 
Accelerating the Speed of Innovation - Jason Waxman, Intel
Accelerating the Speed of Innovation - Jason Waxman, IntelAccelerating the Speed of Innovation - Jason Waxman, Intel
Accelerating the Speed of Innovation - Jason Waxman, Intel
 
CCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny Rachitsky
 
Devopsdays Enstratus Overview
Devopsdays Enstratus OverviewDevopsdays Enstratus Overview
Devopsdays Enstratus Overview
 
Government cloud deployment lessons learned final (4 4 2013)
Government cloud deployment lessons learned final (4 4 2013)Government cloud deployment lessons learned final (4 4 2013)
Government cloud deployment lessons learned final (4 4 2013)
 
Intro to Cloud Computing in the Federal Government
Intro to Cloud Computing in the Federal GovernmentIntro to Cloud Computing in the Federal Government
Intro to Cloud Computing in the Federal Government
 
Cloud Lock-in vs. Cloud Interoperability - Indicthreads cloud computing conf...
Cloud Lock-in vs. Cloud Interoperability  - Indicthreads cloud computing conf...Cloud Lock-in vs. Cloud Interoperability  - Indicthreads cloud computing conf...
Cloud Lock-in vs. Cloud Interoperability - Indicthreads cloud computing conf...
 
Assessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsAssessing no sql databases for telecom applications
Assessing no sql databases for telecom applications
 
Standardization Activities on Cloud Computing
Standardization Activities on Cloud ComputingStandardization Activities on Cloud Computing
Standardization Activities on Cloud Computing
 
null Bangalore meet - Cloud Computing and Security
null Bangalore meet - Cloud Computing and Securitynull Bangalore meet - Cloud Computing and Security
null Bangalore meet - Cloud Computing and Security
 
Dell and OpenStack
Dell and OpenStackDell and OpenStack
Dell and OpenStack
 
The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...
The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...
The Enterprise Business Case for Cloud Transformation: Introducing Everest Gr...
 
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
 
Virtualization Management With Quest V Foglight
Virtualization Management With Quest V FoglightVirtualization Management With Quest V Foglight
Virtualization Management With Quest V Foglight
 
Cloud: CDN Killer?
Cloud: CDN Killer? Cloud: CDN Killer?
Cloud: CDN Killer?
 
Cloud 101 Primer for Busy Executives
Cloud 101 Primer for Busy ExecutivesCloud 101 Primer for Busy Executives
Cloud 101 Primer for Busy Executives
 
Cloud 101 Primer For Busy Executives
Cloud 101 Primer For Busy ExecutivesCloud 101 Primer For Busy Executives
Cloud 101 Primer For Busy Executives
 
Sukhbir jasuja digital_trends_11
Sukhbir jasuja digital_trends_11Sukhbir jasuja digital_trends_11
Sukhbir jasuja digital_trends_11
 
Driving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete DeckDriving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete Deck
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Understanding High Availability - Introducing the Theory and Concepts of High Availability

  • 1. Understanding High Availability Introducing the Theory and Concepts of High Availability Version Number: 1.04 Status: Final Author: Paul Moore, HA Infrastructure Architect Date Published: 21 March 2012 (V1.0), 2 September 2012 File Name: Understanding High Availability v1.04.docx Copyright: © 2012 Paul Moore, Astute Systems License: Creative Commons Attribution 3.0 License
  • 2. Understanding High Availability 2 of 10 Acknowledgements: Name Contributions Brenton Carbins, Socius in Veritas Review, Footnote 4, The Resilient Enterprise Debbie Moore, The Picky Proofreader Review, Proofreading iStockPhoto Cover Photo Reviewers: Role Name Review Date Infrastructure Architect Paul Moore 21-Mar-2012 Infrastructure Architect Brenton Carbins 16-Mar-2012 Reference Documents: Title Author Version Location The Resilient Enterprise Richard Barker, Veritas (Symantec) Published 2002 Commercially Available Contents 1 Introduction 3 2 Definition 3 3 Costs and Benefits of Availability 3 4 Prediction 4 5 Sufficient Understanding 4 6 A Systems Approach 5 7 High Availability Calculation 6 8 Determining Dependencies 7 9 Architectural Requirements 8 10 Logical Requirements 8 11 High Availability Assumptions 8 12 Architectural Decisions 8 13 Glossary 10 Figures Figure 1: A Conceptual Graph of Availability versus Cost 3 Figure 2: An example of a Black Box system 5 Figure 3: An example of Black Box system recursion 5 Figure 4: System Availability Calculation 6 Figure 5: Sub-systems in an IT System 7 © 2012 Paul Moore, Astute Systems Published under the License
  • 3. Understanding High Availability 3 of 10 1 Introduction To develop any high availability infrastructure it is essential to first understand what high availability is and is not. This document attempts to communicate High Availability concepts in a concise and efficient manner. 2 Definition Disaster Recover and High Availability are related, yet different concepts. They can be summarised as follows:  High Availability is an approach to minimise the probability of a failure to provide HA focuses on an operational service. minimising the chance of a High Availability is the automatic continuation or resumption of service after a service failure predictable interruption. Example: Disk mirroring continues to provide data in the event of a disk failure. (But does not guarantee the highly available data is uncorrupted.)  Disaster Recovery is an approach to restoring operational service after a failure to provide it due to a predictable or non-predictable event. Disaster Recovery is a system enabling the recovery of services after an interruption due to events not mitigated by a HA system, or due to the failure of a HA system. Example: Backups enable recovery from a service failure due to a data loss. 3 Costs and Benefits of Availability As the level of service availability increases, the cost of the providing it increases but increases logarithmically due to the increasing architectural complexity and resource use and ends with complexity and the impossibility of providing any increased availability using currently known technology. As cost hyper- a result, an appropriate balance must be achieved between the costs of implementing logarithmically availability and the costs of non-availability. 10000 1000 Conceptual Graph Unit $ Cost 100 of Availability vs Cost 10 1 0.1 0.01 0.001 0.0001 99.998% 99.995% 99.993% 99.990% 99.980% 99.950% 99.930% 99.900% 99.800% 99.500% 99.300% 99.000% 98.000% 99.000% 93.000% 90.000% Availability Figure : A Conceptual Graph of Availability versus Cost © 2012 Paul Moore, Astute Systems Published under the License
  • 4. Understanding High Availability 4 of 10 4 Prediction The architecture of a high availability service requires an assessment and prediction of the A prediction of most likely and frequent causes of potential service interruption and a resultant design to the statistical enable the service to continue operating when the predicted event occurs. likelihood of future events These assessments and predictions will invariably differ from the actual occurrence of events observed during future service operation and as a result the actual performance can never but availability be guaranteed through the use of any particular architecture, design or implementation. The is a historic actual future availability of the service will, by definition, be a historic statistical measurement measure over a set period of time. A high availability architecture seeks to provide higher functional service level by designing so there are no systems capable of withstanding a range of conceivable failure scenarios, however a perfect guarantees service will never be possible due to limitations imposed by hardware, software, communications, policies, cost and the inherent limited ability to predict the likelihood of future events and their consequences. 5 Sufficient Understanding To design a highly available system, a thorough understanding of its components is required The devil is to the degree that all significant availability risks to the system are understood and managed. always in the detail … British writer and scientist, Arthur C. Clarke, stated in his third law of prediction: “Any sufficiently advanced technology is indistinguishable from magic.” 1 Adopting the above terminology, all magic must be eliminated from the system through … so eliminate enquiry and investigation. all “magic”. Several tools can assist in gaining this understanding. Where a system contains complexity and where there is a logical layering of component sub-systems, a systems approach is one of the most useful. This approach is outlined in the following section. 1 See “Magic” in the Glossary. © 2012 Paul Moore, Astute Systems Published under the License
  • 5. Understanding High Availability 5 of 10 6 A Systems Approach 2 In determining a systems level of availability it can be useful to implement a black-box A ‘black box’ approach. This maximises flexibility by enabling arbitrary boundaries to be drawn to best suit model any particular scenario, enforces a rigorous and disciplined focus on the functional requirements of the system and eliminates consideration of unnecessary details which might otherwise complicate the assessment. This system approach and the types of information necessary to use this approach can be best demonstrated using a simple example. The example system takes a two dimensional shape of a particular colour as an input, changes any blue to green, changes any green to red and changes any red to blue, duplicates the shape and vertically flips one of the shapes around its centre of gravity and sends the result to the output. 2D Shape Transformer System Black Box Function: RGB Colour B G,G R,R B; Output Input Duplicate Shape; Vertical Flip One Shape. Properties: 2D Shape, Colour Properties: 2D Shape, Colour Figure : An example of a Black Box system How the system implements its internal functions is unknown and need not be known because all behaviour is fully defined. Consequently the black-box can be used without internal investigation to ease analysis. Investigation of the internal working of the system is required in a number of circumstances, When must the including ‘black box’ be  when the system input, output or function is not fully known, opened?  when the system behavior must be validated,  when the system must be assessed for potential failure vulnerabilities, … with the latter being most important when determining or validating system availability. An investigation can be performed by breaking the original black box system into its various functional components, with each of these in turn being considered as individual black box sub-systems as shown in the diagram below. 2D Shape Transformer System Colour Translator System Vertical Flipping System Black Box Black Box Output Input Output Input Shape Merge System Input Black Box Object Cloning System Input Black Box Output Output Input Output Properties: 2D Shape, Colour Properties: 2D Shape, Colour Function: RGB Colour B G,G R,R B, Duplicate Shape, Vertical Flip One Shape Figure : An example of Black Box system recursion In the event that an investigation of one or more of the individual sub-systems is required, an 3 additional level of recursion can be performed on each of them by using the same criteria and method as used for the system as a whole. 2 See “Black Box” in the Glossary. 3 See “Recursion” in the Glossary. © 2012 Paul Moore, Astute Systems Published under the License
  • 6. Understanding High Availability 6 of 10 7 High Availability Calculation Having established the boundaries of various sub-systems using the systems approach outlined in the previous section, it is now desirable to determine the availability properties of the larger system. Required sub- The availability calculation for a system relies on a statistical treatment of the likelihood of systems failures in sub-systems and an assessment of their direct and indirect consequences. Where decrease any single sub-system is required for system operation, the availability of the system cannot availability be higher than the availability of that sub-system. Conversely, where any single sub-system has other redundant systems that allow it to fail Redundant without causing the system to fail, the availability of the system is higher than it would be in sub-systems the case where the sub-system was non-redundant. These observations and the associated increase availability equations can be seen in the diagram below. availability “The availability of a system is the product of the availability of every serial sub- system upon which that system depends, multiplied by the availability derived from the product of the unavailability of each member of a group of redundant parallel sub-systems where the system depends on the availability of that group.” SYSTEM Sub-System 1 Sub-System 2 Sub-System 3 Sub-System 5 Sub-System 6 (Serial) (Serial) (Parallel) (Serial) (Parallel) SS2 Component 1 (Parallel) SS2 Component 5 SS2 Component 2 (Parallel) Sub-System 7 SS5 Component 1 (Parallel) (Serial) SS2 Component 3 (Parallel) Sub-System 4 (Parallel) (Parallel) SS2 Component 4 (Parallel) Sub-System 8 (Parallel) Avail_S = The availability of system “S” as a percentage of a defined time period. Avail_S = Avail_SS1 * Avail_SS2 * Avail_SS(3,4) * Avail_SS5 * Avail_SS(6,7,8) Where: Avail_SS(3,4) = 1 – (1 – Avail_SS3) * (1 – Avail_SS4) Avail_SS(6,7,8) = 1 – (1 – Avail_SS6) * (1 – Avail_SS7) * (1 – Avail_SS8) Avail_SS2 = Avail_SS2C5 * Avail_SS2C(1,2,3,4) Where: Avail_SS2C(1,2,3,4) = 1 – (1 – Avail_SS2C1) * (1 – Avail_SS2C2) * (1 – Avail_SS2C3) * (1 – Avail_SS2C4) Figure : System Availability Calculation The above diagram demonstrates the availability calculation for a system by recursively using the calculation formula for each black box sub-system. For architectural purposes and in the context of information technology, a system is Either working considered to be in either a failed or working state, with the system in the failed state when or failed, and non-routine staff intervention is required. Nevertheless, from both an external service any human availability and management perspective, the staff that intervene to repair the system in the intervention event of sub-system failure could be conceived to be part of the system. means failed. In practice, High Availability design consists of determining optimal sub-system boundaries that make both the understanding and implementation of a system as simple as possible without compromising either the requirements or functionality. © 2012 Paul Moore, Astute Systems Published under the License
  • 7. Understanding High Availability 7 of 10 8 Determining Dependencies An IT system is comprised of a number of sub-systems, most of which are essential to the system function, and so should be considered as serial dependencies for high availability architecture purposes. Due to the number of unavoidable serial dependencies in the IT system the availability of each sub-system must be maximized through the addition of redundant components within each sub-system. These sub-systems are shown in the diagram below. CAPACITY High Level Function Application CAPACITY Increasing Dependency (Serial) OPERATIONAL Financial Sub Application An IT system (Serial) can fail at any Core Application Technical Disaster Recovery (Serial) layer, at any Influence Monitoring time and for Data Storage Security Business (Parallel) (Parallel) (Parallel) (Serial) Disaster Recovery many different Influence Operating System reasons. Monitoring (Serial) Security Support Political Testing Legal (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) Communications DOCUMENTATION (Serial) MANAGEMENT Financial Electrical Support MAINTENANCE Political Testing PERFORMANCE PREVENTION Legal (Parallel) (Parallel) (Parallel) (Parallel) (Parallel) (Serial) Increasing Abstraction REGULATORY Hardware EXPENDITURE Technical EXTERNAL (Serial) FUNCTION TRAINING Influence CONTRACTUAL DETECTION Temperature PLANNING PLANNING (Serial) Mechanical INTERNAL Business (Serial) Influence Location (Serial) Figure : Sub-systems in an IT System The most dependent layers of the model drive the requirements for those layers upon which Well designed they depend. For example, if the application instance is able to use an alternate instance of a environments core application, the availability of a specific instance of that core application is less critical. need less HA Conversely, when the application instance cannot use an alternate instance of the core work at lower application, that application instance logically cannot have a higher availability than that of levels of the the core application instance upon which it depends, and consequently, that core application stack. instance is critical to the operation of the application instance. In many IT Systems the most critical component is the data storage sub-system, since there Data storage is is often a requirement for a single source of authoritative data upon which to operate. This often critical contrasts with other sub-systems such as location, electrical, communications and hardware due to the need which can often be made redundant to form highly available sub-systems. for a single source of truth. © 2012 Paul Moore, Astute Systems Published under the License
  • 8. Understanding High Availability 8 of 10 9 Architectural Requirements  The solution must be analysed for single points of failure which must be addressed.  The solution must be analysed for dependencies which must be addressed.  An estimated availability for the solution must be determined to ensure that this availability figure is consistent with the availability requirements. 10 Logical Requirements A system As seen in figure 5 on the previous page, the following logical requirements must be met in requires… order to provide a system capable of meeting the business requirements. The system must be sufficiently documented so as to be supportable and maintainable. documentation There must be sufficient training available for support staff to be able to maintain the system training in a timely manner. The availability of the system must be measurable for function and responsiveness. This will availability of require the retention of specific metrics. configuration Configuration details of system components must be available in a timely manner so that a failure of a hardware system will not result in the loss of unique configuration information. configuration auditability Configuration details of system components must be auditable so that erroneous administrative configuration changes can be restored in a timely manner. 11 High Availability Assumptions That a failure of a single system must be mitigated against, and that the failure of multiple systems will be considered to be a failure in the larger system. That a system failure during the critical time window will require automated mitigation and that there will be insufficient time for support staff to be notified, respond, analyse and perform reliable mitigation to restore service. That a failure of a system component can occur at any logical level of the IT solution and can include human mistakes. That the system will scale appropriately and that there is sufficient time during any required time window for the systems to perform all required operations. (IE: The system availability is not required to exceed 100 %.) 12 Architectural Decisions By parallelising sub-systems, no single sub-system instance represents a single point of failure and the availability of the system as a whole is increased. Parallelising sub-systems enables the performance of most maintenance activities on individual sub-system instances without the system ceasing to function. Decision  Implement the parallelisation of sub-systems where possible. By distributing load between parallel sub-systems the throughput of the group of parallel sub-system instances is higher than it would be for a single sub-system instance. Distributing the load between parallel sub-system instances leverages the investment in hardware and software. Decision  Distribute load between parallel sub-systems where possible. By monitoring the responsiveness of parallel sub-systems, traffic can be directed to responsive instances and away from unresponsive or failed ones. When traffic is routed centrally, effective service delivery is maximised by minimising the duration that traffic is routed to unresponsive or failed parallel sub-system instances. Decision  Monitor the responsiveness of parallel sub-systems where possible. By using a clustered file system for all sub-system configurations, configuration files can be more easily managed. In the event of the failure of a sub-system, the unique © 2012 Paul Moore, Astute Systems Published under the License
  • 9. Understanding High Availability 9 of 10 sub-system instances configuration files are less likely to be lost and are more rapidly available when a new sub-system instance is deployed as part of a disaster recovery plan. The clustered configuration file system serves as a highly available single source of critical configuration data which cannot be stored in the database. The clustered configuration file system can also be used to enable rapid and automated recovery for active/standby sub-systems that maintain state information outside of the database. Decision  Implement a shared file system to all sub-systems for configuration management. By using a clustered file system for all sub-system configurations, configuration files can be more easily managed4. As human configuration mistakes are a common cause of IT system failure, managing configuration files in a simple, logical, centralised, auditable and consistent manner is one way to increase availability by decreasing the chances of mistakes and decreasing the time taken to recovery from them. Decision  Implement a version control system for all sub-system configuration files. As sub-systems may fail for unknown reasons, availability can be maximised by restarting failed sub-system processes on the same or on alternate machines. Cluster management software, such as Veritas Cluster Server, can automate the execution of these pre-planned mitigation decisions. The clustering software will provide a global view of the availability and status of all services running on both primary and disaster recovery sites. An administrator must be able to easily fail-over sub-systems or the entire system from the primary to the disaster recovery site and back again. The use of cluster management software is most critical for non-parallelisable sub- systems upon which the entire system is dependent.  Implement cluster management software to automatically restart failed sub-systems. Decision Inter-site data replication is necessary to provide a remote copy of data for disaster recovery and high availability. This can be performed by a hardware solution or on a file system level.  Implement inter-site data replication. Decision 4 This is always a cause of contentious ‘camps’ in architectural discussions. One position is that ‘running configuration’ instances should use identical configuration files, while ‘individual configuration file’ proponents maintain that shared configuration files make upgrades more difficult. A possible compromise is the use of snapshots and/or altered mount details only during upgrade procedures. © 2012 Paul Moore, Astute Systems Published under the License
  • 10. Understanding High Availability 10 of 10 13 Glossary Term Description Active-Standby, Hot/Active: Actively processing data. Warm/Standby: Processing capability on standby. Active-Active Active-Standby or Hot-Warm is defined as a model where the production application Hot-Warm, instance or facility (Active or Hot) will provide operational services in a business as usual state while a disaster recovery application instance or facility (Standby or Warm) is available Hot-Hot to take over service provision in the event of a failure in production. Black Box In science and engineering, a black box is a device, system or object which can be viewed solely in terms of its input, output and transfer characteristics without any knowledge of its internal workings, that is, its implementation is "opaque" (black). (Wikipedia) Database Database replication can be used on many database management systems, usually with a Replication active-standby relationship between the original and the copies. The active logs the updates, which then ripple through to the standby copies. The standby acknowledges that it has received the update successfully, thus allowing the sending (and potentially re-sending until successfully applied) of subsequent updates. Database replication provides a higher level of reporting than log shipping; but does not lock passive databases from user changes and so is unsuitable for failover. (Wikipedia) Disaster Disaster Recovery is a system enabling the recovery of services after an interruption due to Recovery events not mitigated by a High Availability system, or due to High Availability system failure. High Availability High Availability is the automatic continuation or resumption of service after a predictable interruption. Log Shipping Log shipping is the process of automating the backup of a database and transaction log files on a primary database server, and then restoring them onto a standby server. Similar to Database Replication, the primary purpose of log shipping is to increase database availability by maintaining a backup server to quickly replace the primary server. Log Shipping locks the standby database from user changes and is often chosen for its low cost in human and server resources and ease of implementation. Failover between primary and standby servers is manual and limited reporting capabilities are possible. (Wikipedia) Magic In the context of programming, Magic is an informal term for the use of code that handles complex tasks while hiding that complexity to present a simple interface. (Wikipedia) In computer system design, Magic is used as an informal term to describe gaps in understanding the process of interaction between one system and another. Oracle Streams Oracle Streams is available on Enterprise Edition systems only and enables propagation of information within and between Oracle and other databases. Oracle announced Streams deprecation and now encourages usage of Golden Gate (acquired by Oracle in July 2009). Recursion Recursion is the process of repeating items in a self-similar way. For instance, when the surfaces of two mirrors are exactly parallel with each other the nested images that occur are a form of infinite recursion. The term has a variety of meanings specific to a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics and computer science, in which it refers to a method of defining functions in which the function being defined is applied within its own definition. Specifically this defines an infinite number of instances (function values), using a finite expression that for some instances may refer to other instances, but in such a way that no loop or infinite chain of references can occur. The term is also used more generally to describe a process of repeating objects in a self-similar way. (Wikipedia) The Resilient The Resilient Enterprise is a well-known reference book on high availability and disaster Enterprise recovery published by Veritas Software (now Symantec) in 2002. Veritas Cluster Veritas Cluster Server is High-availability cluster software, for Unix, Linux and Microsoft Server Windows computer systems, created by Veritas Software (now part of Symantec). It provides application cluster capabilities to systems running databases, file sharing on a network, electronic commerce websites or other applications. Veritas Cluster Server is one of the few products in the industry that provides both high availability and disaster recovery across all major operating systems while supporting 40+ major application / replication technologies out of the box. Similar products include Fujitsu PRIMECLUSTER, IBM HACMP, HP Serviceguard, IBM Tivoli System Automation for Multiplatforms, Linux-HA, Microsoft Cluster Server, NEC ExpressCluster, Red Hat Cluster Suite, SteelEye LifeKeeper and Sun Cluster. (Wikipedia) © 2012 Paul Moore, Astute Systems Published under the License