An MC/ServiceGuard cluster is a networked grouping of HP 9000 series 800 servers (nodes) having sufficient redundancy of software and hardware that a single point of failure will not significantly disrupt service. MC/ServiceGuard software provides only part of a high availability solution that includes disk mirroring, redundant disk interface links and uninterruptible power supplies (UPS). Applications and services are grouped together in packages. In case of a service, node or network failure, MC/ServiceGuard can automatically transfer control of all system resources in a designated package to another node within a cluster, allowing your applications to remain available with minimal interruption.
MC/ServiceGuard provides the following features:
In the case of LAN failure, MC/ServiceGuard switches to a standby LAN on the same node.
In the case of SPU (Single Processing Unit) failure, the package is transferred from a failed SPU to a functioning SPU automatically and in a minimum amount of time.
For software failures, an application can be restarted on the same node or another node with minimum disruption as per predefined rules.
MC/ServiceGuard also gives you the advantage of easily transferring control of your application to another SPU in order to bring the original SPU down for system administration, maintenance or version upgrades.
Cluster : A cluster is a collection of up to 16 HP-UX servers (8 for MC/ServiceGuard OPS Edition) connected together in order to provide failover functionality to the application(s) that execute on those servers.
Node : A node is a server that is a member of an MC/ServiceGuard cluster. Specifically, the term “node” refers to the server’s role as a member of a cluster. So, when a node is down, it means that the cluster daemons on that server do not respond. When a node is up, it is an active member of the cluster.
Failover : A failover is an event that occurs whenever a clustered node or a package fails. At such time, the package is transferred from its primary node to its adoptive node for execution.
Primary Node : For each package defined within a cluster, the primary node is the server that is the first choice for execution of that package.
Adoptive or Secondary Node : An adoptive node is a server that is defined to take over the execution of a package in the event of a failover.
Local LAN Failover : A local LAN failover occurs when there is a communications failure on an initialized LAN on a node in a cluster and traffic is automatically transferred to an un-initiated LAN of the same type. The un-initiated LAN must be connected to the same network as the failed LAN. There is no interruption in service to the MC/ServiceGuard package.
Heartbeat : The signal that is exchanged between clustered nodes is called the heartbeat. Heartbeat failure will trigger a Service Guard failover.
Package : A package is a grouping of software (application), disks (volume groups), network addresses and monitoring services that execute on a server. When the package fails the application, disks, network addresses and monitoring services transfer to an adoptive node for execution. If the whole package can not be transferred to the adoptive node, then it will remain in a halted state.
Cluster Aware or Shared Volume Group : (This does not apply for the PowerPlant cluster since we are using Veritas Volume Manager). Any volume group that has been made cluster aware is said to have been “clusterized”. Specifically, this means that the vgchange –c y <volume-group-name> command has been executed. This command can be executed only when the cluster is up and MC/ServiceGuard daemon, cmcld is running.
Shared Logical Volume : A shared logical volume is one defined in a cluster aware volume group.
Stationary IP Address and Name : The IP address that is configured on a LAN interface and associated with anode in a cluster is referred to as a stationary IP address. The hostname of that node is usually assigned to one of those IP addresses.
Floating or Relocatable IP Address and Name : More than one IP address can be assigned to a single LAN interface. The additional IP address configured on a LAN interface and associated with a package is called a floating or relocatable IP address. This IP address must be in the same subnet as the stationary IP addresses within the cluster. When a package overflow occurs, the floating IP address is de-configured from the LAN interface on the primary node and configured on the appropriate LAN interface on the adoptive node.
NODE_TIMEOUT – The Node Timeout parameter controls Cluster Timing. The default is 2 seconds. This setting yields the fastest cluster reformation. However, the use of the default value increases the potential for spurious reformations due to momentary system hangs or network load spikes.
Cabinet – A cabinet may contain several components, such as one or two PCI (Peripheral Component Interconnect) boxes for I/O in megahertz, as well as disk storage. I/O expansion cabinets can bolt to the main cabinet and it can be comprised of five (5) PCI boxes and up to six (6) Digital Power Supplies (DPS). The HP Superdome cabinet can be comprised of up to 28gb of memory per cabinet, Peak 16GB/s memory controller bandwith and 64GBs per 64-way cabinet, as well as 16 200MB/s (33MHz PCI) or 400MB/s (66MHz PCI) I/O channels per cabinet.
ServiceGuard Cluster Configuration for Partitioned Systems
The MC/ServiceGuard product provides an infrastructure for the design and implementation of highly available HP-UX clusters that can quickly restore mission critical application services after hardware or software failures. To achieve the highest level of availability, clusters must be configured to eliminate all single points of failure (SPOF). This requires a careful analysis of the hardware and software infrastructure used to build the cluster. Partitioning technologies such as Superdome nPartitions and the HP-UX Virtual Partitions (VPARS) present some unique issues that must be considered when utilizing them within a ServiceGuard configuration.
Partitioning technologies such as nPartitions and VPARS provide increased flexibility in effectively managing system resources. They can be used to provide hardware and/or software fault isolation between applications sharing the same hardware platform. These technologies also allow hardware resources to be more efficiently utilized, based on application capacity requirements, and they provide the means to quickly re-deploy the hardware resources should the application requirements change. Given this capability, it is natural to want to utilize these technologies when designing MC/ServiceGuard clusters. Care must be taken, however, as the use of partitioning does present some unique failure scenarios that must be considered when designing a cluster to meet specific uptime requirements.
ServiceGuard Cluster Configuration for Partitioned Systems (continued)
The partitioning provided by nPartitions is done at a hardware level and each partition is isolated from both hardware and software failures of other partitions. VPARS partitioning is implemented at a software level. While this provides greater flexibility in dividing hardware resources between partitions and allows partitioning on legacy systems, it does not provide any isolation of hardware failures between the partitions.
Hardware Redundancy – ServiceGuard, like all other HA clustering products, uses hardware redundancy to maintain application availability. For example, the ServiceGuard configuration guidelines require redundant networking paths between the nodes in the cluster. This requirement protects against total loss of communication to a node if a networking interface card fails. If a card should fail, there is a redundant card that can take over for it.
As can be readily seen, this strategy of hardware redundancy relies on an important underlying assumption: the failure of one component is independent of the failure of other components. That is, if the two networking cards were somehow related, then there could exist a failure event that would disable them both simultaneously. This represents a SPOF and effectively defeats the purpose of having redundant cards. It is for this reason that the ServiceGuard configuration rules do not allow both heartbeat networks on a node to travel through multiple ports on the same multi-ported networking interface. A single networking interface card failure would disable both heartbeat networks.
Cluster MembershipProtocol – This same philosophy of hardware redundancy is reflected in the clustering concept. If a node in the cluster fails, another node is available to take over applications that were active on the failed node. Determining, with certainty, which nodes in the cluster are currently operational is accomplished through a cluster membership protocol whereby the nodes exchange heartbeat messages and maintain a cluster quorum .
After a failure that results in loss of communication between the nodes, active cluster nodes execute a cluster reformation algorithm that is used to determine the new cluster quorum. This new quorum, in conjunction with the previous quorum, is used to determine which nodes remain in ServiceGuard Cluster Configuration for Partitioned Systems.
The algorithm for cluster reformation generally requires a cluster quorum of a strict majority, that is , more than 50% of the nodes that were previously running. However, exactly 50% of the previously running nodes are allowed to re-form as a new cluster, provided there is a guarantee that the other 50% of the previously running nodes do not also re-form. In these cases, some form of quorum arbitration or tie-breaker is needed. For example, if there is a communication failure between the nodes in a two-node cluster, and each node is attempting to re-form the cluster, then ServiceGuard must only allow one node to form the new cluster. This is accomplished by configuring a cluster lock .
The important concept to note here is that if more than 50% of the nodes in the cluster fail at the same time, then the remaining nodes have insufficient quorum to form a new cluster and fail themselves. This is irrespective of whether or not a cluster lock has been configured. It is for this reason that cluster configuration must be carefully analyzed to prevent failure modes that are common amongst the cluster nodes.
Quorum Arbitration – One form of quorum arbitration is a shared disk device configured as a cluster lock.
The cluster lock disk is a disk area located in a volume group that is shared by all nodes in the cluster. The cluster lock disk is used as a tie-breaker only for situations in which a running cluster fails and, as ServiceGuard attempts to form a new cluster, the cluster is split into two sub-clusters of equal size. Each sub-cluster attempts to acquire the cluster lock. The sub-cluster that gets the cluster lock forms the new cluster and the nodes that were unable to get the lock cease activity. This prevents the possibility of split-brain activity, this is, two sub-clusters running at the same time. If the two sub-clusters are of unequal size, the sub-cluster with grater than 50% of the previous quorum forms the new cluster and the cluster lock is not used.
For obvious reasons, two node cluster configurations are required to configure some type of quorum arbitration. By definition, failure of a node or loss of communication in a two-node cluster results in a 50% partition. Due to the assumption that nodes fail independently of each other ( independent failure assumption ), the use of quorum arbitration for cluster configurations with three or more nodes is optional, though highly recommended.
Partition Interactions – We need to examine what extent the partitioning schemes either meet or violate the independent failure assumption .
The partitioning provided by nPartitions is done at a hardware level and each partition is isolated from both hardware and software failures of other partitions. This provides very good isolation between the OS instances running within the partitions. So in this sense, nPartitions meets the assumption that the failure of one node (partition) will not affect other nodes. However, within the Superdome infrastructure, there does exist a very small possibility of a failure that can affect all partitions within the cabinet. So, to the extent that this infrastructure failure exists, nPartitions violates the independent failure assumption.
The VPARS form of partitioning is implemented at a software level. While this provides greater flexibility in dividing hardware resources between partitions and allows partitioning on legacy systems, it does not provide any isolation of hardware failures between the partitions. Thus, the failure of a hardware component being used by one partition can bring down other partitions within the same hardware platform. From a software perspective, VPARS provides isolation for most software failures, such as kernel panics, between partitions. Due to the lack of hardware isolation however, there is no guarantee that a failure, such as a misbehaving kernel that erroneously writes to the wrong memory address, will not affect other OS partitions. Based on these observations, one can conclude that VPARS will violate the independent failure assumption to a greater degree that will nPartitions.
In addition to the failure case interactions, VPAR exhibit a behavior that should also be considered when including a VPARS as a node in a ServiceGuard cluster. Due to the nature of the hardware/firmware sharing between VPARS, it is possible for one partition to induce latency in other partitions. For example, during boot up, when the booting partition request the system firmware to initialize the boot disk, it is possible for other partitions running in the same machine to become blocked until the initialization operation completes. During ServiceGuard qualification testing, delays of up to 13 seconds have been observed on systems with a PCI bus and SCSI disks.
Cluster Configuration Considerations – Using any information from the preceding sections, we can now assess any impacts or potential issues that arise from utilizing partitions (either VPARS or NPartitions) as part of a ServiceGuard cluster. From a ServiceGuard perspective, an OS instance running in a partition, is not treated any differently than OS instances running on a non-partitioned nodes.
Quorum Arbitration Requirements – ServiceGuard configurations rules for non-partitioned systems require the use of a cluster lock only in the two node cluster case. This requirement is in place to protect against failures that result in a 50% quorum with respect to the membership prior to the failure. Clusters with more than two nodes do not have this as a strict requirement because of the “independent failure” assumption. As can be seen, this assumption is no longer valid when dealing with partitions. Cluster configurations that contain OS instances running within a membership based on complete failure of hardware components that support more than one partition.
Rule 1. Configurations containing the potential for a loss of more than 50% membership resulting from a single failure are not supported.
These include configurations with the majority of nodes as partitions within a single hardware cabinet. This implies that in the two cabinet case, the partitions must be symmetrically divided between the cabinets.
HP Superdome - 16-, 32-, 64-way and the IO expansion cabinet — successfully passed all twelve criteria required by The Uptime Institute for compliance certification. The systems continued to operate without interruption or loss of functionality through all testing manipulations. The systems were monitored at the operating console and showed no errors, hard or soft, during these tests. Certification was earned at the Tier IV level, the most fault tolerant classification.
Exception: Where all cluster nodes are running within partitions in a single cabinet (such as the so-called cluster in a box configuration). The configuration is supported provided users understand and accept the possibility of a complete cluster failure. This configuration is discussed in the Section, “Cluster In-A-Box”.
Rule 2. Configurations containing the potential for a loss of exactly 50% membership resulting from a single failure require the use of a cluster lock.
Cluster configurations where the nodes are running in partitions that are wholly contained within two hardware cabinets.
Example: to be supported, a four-node cluster consisting of two nPartitions in each of two Superdome cabinets, would require the use of a cluster lock.
Cluster configurations where the nodes are running as VPARS partitions that are wholly contained within two nPartitions.
Cluster Configuration and Partitions – Given the configuration requirements described in Rule 1 and Rule 2, a few interesting observations can be made of clusters utilizing partitioning:
If it is determined that a cluster lock is needed for a particular configuration, the cluster must be configured such that the cluster lock is isolated from failures affecting the cluster nodes. This means that the lock device must be powered independently of the cluster nodes (such as hardware cabinets containing the partitions that make up the cluster).
Clusters wholly contained within two hardware cabinets and that utilize the cluster lock for quorum arbitration, are limited to either two or four nodes. This is due to a combination of the existing ServiceGuard rule that limits support of the cluster lock to four nodes and Rule 1.
Cluster configurations can contain a mixture of VPARS, nPartitions, and independent nodes as long as quorum requirements are met.
For a cluster configuration to contain no single points of failure, it must extend beyond a single hardware cabinet, comply with both the quorum rules and the ServiceGuard configuration rules.
Cluster in-A-Box – One unique cluster configuration possibility that is enabled by partitioning is the so-called cluster in-a-box. In this case all the OS instances (nodes) of the cluster are running in partitions within the same hardware cabinet. While this configuration is subject to single points of failure, it may provide adequate availability characteristics for some applications and is thus considered a supported ServiceGuard configuration. Users must carefully assess the potential impact of a complete cluster failure on their availability requirements before choosing to deploy this type of cluster configuration.
A cluster in-a-box configuration consisting exclusively of VPARS is susceptible to a wider variety of possible failures, that could result in a complete cluster failure, than is a cluster made up exclusively of nPartitions.
I/O Considerations – ServiceGuard does not treat OS instances running in a partition any differently than those running on an independent node. Thus, partitions do not provide any exemptions from the normal ServiceGuard connectivity rules (such as redundant paths for heartbeat networks and to storage) nor do they impose any new requirements. There is however a couple of interesting aspects related to partitioned systems that should be noted:
While not a strictly “partitioning” issue per-se, the Superdome platform that supports nPartitions contains its interface cards in an I/O chassis and there can be more that one I/O chassis per partition. Since the I/O chassis represents a potential unit of failure, nPartitions redundant I/O paths must be configured in separate I/O chassis. Generally speaking, Superdome provides enough I/O capacity that ServiceGuard’s redundant path requirement should not constrain the user of partitioning within the cluster.
VPARS on the other hand must share essentially one node’s worth of I/O capacity. In this case, the redundant path requirement can be a limiting factor in determining the number of partitions that can be configured on a single hardware platform.
The use of “combination” cards that combine both network and storage can help in some situations. However, redundant paths for a particular device must be split across separate interface cards (for example, using multiple ports on the same network interface card for the heartbeat lans is not supported).
Latency Considerations – As mentioned previously, there is a latency issue, unique to VPARS, that must be considered when configuring a ServiceGuard cluster to utilize VPARS.
There are certain operations performed by one partition (such as initializing the boot disk during boot up) that can induce delays in other partitions on the same hardware platform. The net result to ServiceGuard is the loss of cluster heartbeats if the delay exceeds the configured NODE_TIMEOUT parameter. If this should happen, the cluster starts the cluster re-formation protocol and, providing the delay is within the failover time, the delayed node simply rejoins the cluster. This results in cluster re-formation messages appearing in the syslog(1m) file with diagnostic messages from the ServiceGuard cluster monitor (cmcld) describing the length of the delay.
For this reason, it is recommended that clusters containing nodes running in a VPARS partition, increase the NODE_TIMEOUT parameter to fourteen seconds in order to eliminate cluster reformations caused by latency with the VPARS nodes.
A naming convention is a commonly understood pattern that is used to name files and directories. The paragraphs below describe the convention used for most MC/ServiceGuard installations. Actual file names and commands will be noted in bold text.
All MC/ServiceGuard configuration files must reside in the /etc/cmcluster directory. The binary cluster configuration file should be named cmclconfig. The ASCII file that is used to create the cluster is commonly named cmcluster.ascii. The ASCII configuration file must be edited in order to change the basic cluster parameters. When the cluster is recreated, using cmapplyconf , the cmclconfig file will be recreated.
For each package that is defined within the cluster, there should be a directory below /etc/cmcluster to contain the package definition and control files. That directory should be named after the package name used in the package configuration file. For example, 3G EAMS the package configuration file is <package name>.conf . The control file is called <package name>.ctl .
The state of the cluster should be verified before, during and after performing all cluster and package activities. Before any cluster or package activities are performed, use the cmviewcl –v command to verify that the cluster is an appropriate state for the action you intend to perform. During the execution of any package commands, monitor the system and package log files. Use the following commands to monitor package start and stops and to monitor cluster activity:
Starting a Cluster - When all Systems are UP but all Nodes are Down: Cluster Activity Terminated:
# cmruncl –v
# cmruncl –v –n<node-name-l> - Use only when the cluster is not running.
MC/ServiceGuard cannot guarantee data integrity if you try to start a cluster with the cmruncl –n command while one or more of the nodes of a cluster is already running a cluster.
Adding Nodes to a Cluster - Use the cmrunnode command to add one or more nodes to an already running cluster. Any node you add must already be a part of the cluster configuration. The following example adds node <node-name-2> to the cluster that was previously started with the cmruncl command:
# cmrunnode –v <node-name-2>
Since the cluster is already running, the node joins the cluster and packages may be started on that node. If the node does not find its cluster running, or the node is not part of the cluster configuration, the command fails.
Removing Nodes From a Cluster - To halt a node with a running package, use the –f option. If a package was running that could be switched to an adoptive node, the switch takes place and the package starts on the adoptive node. For example, the following command causes the MC/ServiceGuard daemon running on node <node-name-2>, its adoptive node:
# cmhaltnode –f –v <node-name-l>
Returning a Node to a Cluster - To return a node to the cluster, use cmrunnode .
Reconfiguring the Cluster - To make a permanent change in the cluster configuration: Halt the cluster on all nodes only if cluster timing parameters are being changed, all other changes can be dynamically done on a running cluster.
On one node, reconfigure the cluster by editing the cluster definition file:
Use the cmcheckconf command to check the ASCII cluster configuration file. For example:
# cmcheckconf –v –C /etc/cmcluster/cluster.ascii
Use the cmapplyconf command to copy the binary cluster configuration file to all nodes. This file overwrites any previous version of the binary cluster configuration file. For example:
# cmapplyconf –v –C /etc/cmcluster/cluster.ascii
If the cluster was brought down, use the cmruncl command to start the cluster on all nodes or on a subset of nodes, as desired.
Note that this procedure is for cluster changes only. For permanent package modifications, you would Reconfigure a Package. Also note that in order to maintain a package definition in the cluster, there must be appropriate references in the cmcheckconf and cmapplyconf commands:
Halting a Package - You halt a MC/ServiceGuard package when you wish to bring the package out of use but wish the node to continue in operation. Halting a package has a different effect than halting a node. When you halt a node, its packages may switch to adoptive nodes, assuming that package switching is enabled for them. When you halt a package, it is disabled from switching to another node, and must be restarted manually on another node or on the same node. For example, use the cmhaltpkg command to halt a package, as follows:
# cmhaltpkg <package name>
This command halts the package and disables it from switching to another node.
Moving a Package - Before you can move a package, you must halt it on its current node using the cmhaltpkg command. This action not only halts the package, but also disables package switching back to the node on which it is halted.
After you halt the package you must restart it and enable package switching. You can do this by issuing the cmrunpkg command followed by the cmmodpkg command can be used with the –n option to enable a package to run on a node if the package has been disabled from running on that node due to some sort of error. If no node is specified, the node from which the command is issued is the implied node. For example:
# cmhaltpkg –n <node-name-2> <package name>
# cmrunpkg –n <node-name-l> <package name>
# cmmodpkg –e <package name>
This procedure is useful when a failover has occurred and you want to push the package back to its primary node.
Reconfiguring a Package - To make a permanent change in package configuration, you must use the following steps:
Halt the package
On the primary node, reconfigure the package by editing the package configuration file: /etc/cmcluster/<database directory>/<package name>.conf
To modify the package control script, edit the package control script directly: /etc/cmcluster/<database directory>/<package name>.ctl . Any changes in service names will also require changes in the package configuration file.
Copy the modified control script to all nodes that can run the package.
Use the cmcheckconf command to check the ASCII cluster configuration file and package configuration file. For example:
NOTE: For package changes that only involve modifications of the package control file, etc/cmcluster/<database directory>/<package name>.ctl , it is only necessary to halt that package, make the necessary system changes, modify the package control file and distribute it, then restart that package. Changes in the package control file that do not affect the package configuration file include, but are not limited to changes in the package run and/or halt commands or logical volume changes in existing package volume groups.