High Availability ClusterHigh-availability clusters (also knownas HA clusters or failover clusters) aregroups of computers that supportserver applications that can bereliably utilized with a minimum ofdown-time. They operate by harnessing redundant computers in groups or clusters that provide continued service when system components
Uses of HA ClusterHA clusters are often used for:1. Critical Databases2. File Sharing on a Network3. Business Applications4. Customer Services such as electronic commerce websites
Cluster MonitoringHA clusters usually use a heartbeatprivate network connection which isused to monitor the health and status ofeach node in the cluster.
Application DesignRequirementsIn order to run in a high-availability clusterenvironment, an application must satisfyat least the following technicalrequirements: There must be a relatively easy way to start, stop, force-stop, and check the status of the application. The application must be able to use shared storage.
Count… Ability to restart on another node at the last state before failure using the saved state from the shared storage. The application must not corrupt data if it crashes, or restarts from the saved state.
Node ConfigurationsIn two-node cluster configurations cansometimes be categorized into one ofthe following models:1. Active/active2. Active/passive3. N+14. N+M5. N-to-16. N-to-N
Active/activeTraffic intended for the failed node iseither passed onto an existing nodeor load balanced across theremaining nodes.This is usually only possible whenthe nodes utilize a homogeneoussoftware configuration.
Active/passiveProvides a fully redundant instanceof each node, which is only broughtonline when its associated primarynode fails.This configuration typically requiresthe most extra hardware.
Node ReliabilityHA clusters usually utilize all availabletechniques to make the individualsystems and shared infrastructure asreliable as possible. These include:1. Disk Mirroring2. Redundant Network3. Redundant Storage Area Network4. Redundant Electrical Power5. Redundant Power Supply Units
Failover StrategiesSystems that handle failures indistributed computing have differentstrategies to cure a failure. Forinstance, API defines three ways toconfigure a failover:1. FAIL_FAST: The try fails if the first node cannot be reached.2. ON_FAIL_TRY_ONE_NEXT_AVAILABL E: Tries one more host before giving up3. ON_FAIL_TRY_ALL_AVAILABLE: Tries all existing nodes before giving up
What is SRX Cluster…?SRX Cluster provides network noderedundancy by grouping a pair of thesame kind of supported SRX Seriesdevices or J Series devices into acluster. The devices must be running the same version of Junos OS.
SRX PlaneThe SRX has a separated planes.Depending on the SRX platformarchitecture, the separation variesfrom being separate processesrunning on separate cores tocompletely physically differentiatedsubsystems.1. Control Plane2. Data Plane
Control Plane The control plane is used in HA to synchronize the kernel state between the two REs. It also provides a path between the two devices to send hello messages between them. The two devices’ control planes talk to each other over a control link. This link is reserved for control plane communication.
Count… The control plane is always in an active/backup state. This means only one RE can be the master over the cluster’s configuration and state. This ensures that there is only one ultimate truth over the state of the cluster. If the primary RE fails, the secondary takes over for it. Creating an active/active control plane makes synchronization more difficult because many checks would need to be put in place to validate which RE is right.
Data Plane The data plane’s responsibility in the SRX is to pass data and processes based on the administrator’s configuration. All session and service states are maintained on the data plane. The REs and/or control plane are not responsible for maintaining state.
Responsibilities of Data Plane The data plane has a few responsibilities when it comes to HA implementation. First and foremost is state synchronization. The state of sessions and services is shared between the two devices. Sessions are the state of the current set of traffic that is going through the SRX, and services are other items such as:1. VPN2. IPS3. ALGs
Chassis ClusterAn SRX cluster implements a conceptcalled chassis cluster. A chassiscluster takes the two SRX devices andrepresents them as a single device. The interfaces are numbered in such a way that they are counted starting at the first chassis and then end on the second chassis.
Chassis Cluster Functionality1. Resilient system architecture, with a single active control plane for the entire cluster and multiple Packet Forwarding Engines. This architecture presents a single device view of the cluster.2. Synchronization of configuration and dynamic runtime states between nodes within a cluster.3. Monitoring of physical interfaces, and failover if the failure parameters cross a configured threshold.
States of ClusterThe different states that a cluster can be in atany given instant are as follows:1. Hold2. Primary3. Secondary-Hold4. Secondary5. Ineligible6. DisabledA state transition can be triggered because ofany event, such as interface monitoring, SPUmonitoring, failures, and manual failovers.
Chassis Cluster FormationTo form a chassis cluster, a pair of thesame kind of supported SRX Seriesdevices or J Series devices arecombined to act as a single systemthat enforces the same overall security. You can deploy up to 15 chassis clusters in a Layer 2 domain.
Identification of ClustersClusters and nodes are identified inthe following way: A cluster is identified by a cluster ID (cluster-id) specified as a number from 1 through15. A cluster node is identified by a node ID (node) specified as a number from 0 to 1.
Redundancy GroupsA redundancy group is an abstractconstruct that includes and manages acollection of objects. A redundancygroup contains objects on both nodes. A redundancy group is primary on one node and backup on the other at any time. We can create up to 128 redundancy groups.
Primacy of RedundancyGroupThree things determine the primacy of aredundancy group:1. The priority configured for the node2. The node ID (in case of tied priorities)3. The order in which the node comes up.If a lower priority node comes upfirst, then it will assume the primacy for aredundancy group (and will stay asprimary if preempt is not enabled).
Redundancy GroupMonitoringA redundancy group is automatically failover to another node, for this it has tomonitor some following components of theChassis Cluster:1. Interface Monitoring2. IP Address Monitoring3. Monitoring of Global-Level Objects i. SPU Monitoring ii. Flowd Monitoring iii. Cold-Sync Monitoring
Chassis Cluster Redundancy Group FailoverA redundancy group is a collection ofobjects that fail over as a group.Each redundancy group monitors aset of objects (physicalinterfaces), and each monitoredobject is assigned a weight. Each redundancy group has an initial threshold of 255.
Count… When a monitored object fails, the weight of the object is subtracted from the threshold value of the redundancy group. When the threshold value reaches zero, the redundancy group fails over to the other node. As a result, all the objects associated with the redundancy group fail over as well.
Count… Because back-to-back redundancy group failovers that occur too quickly can cause a cluster to exhibit unpredictable behavior, a dampening time between failovers is needed. The default dampening time is 300 seconds (5 minutes) for redundancy group 0 and is configurable to up to 1800 seconds with the hold-down- interval statement.
Count…Redundancy groups x (redundancygroups numbered 1 through 128)have a default dampening time of 1second, with a range of 0 through1800 seconds.The hold-down interval affectsmanual failovers, as well asautomatic failovers associated withmonitoring failures.
Chassis Cluster Redundancy Group Manual FailoverWe can initiate a redundancy groupx failover manually. A manual failoverapplies until a failback event occurs.You can also initiate a redundancygroup 0 failover manually if you wantto change the primary node forredundancy group 0.
State Transitions CasesThere are three transition cases:1. Reboot case—The node in thesecondary-hold state transitions to theprimary state; the other node goesdead (inactive).
Count…2. Control link failure case—The nodein the secondary-hold state transitionsto the ineligible state and then to adisabled state; the other nodetransitions to the primary state.3. Fabric link failure case—The nodein the secondary-hold state transitionsdirectly to the disabled state.
SNMP Failover TrapsChassis clustering supports SNMPtraps, which are triggered whenever there isa redundancy group failover.The trap message can help youtroubleshoot failovers. It contains thefollowing information:1. The cluster ID and node ID2. The reason for the failover3. The redundancy group that is involved in the failover4. The redundancy group’s previous state and current state
Chassis Cluster InterfacesA network device doesn’t help anetwork without participating in trafficprocessing.An SRX has two different interfacetypes that it can use to process trafficthat are:1. Reth Interface2. Local Interface
Reth Interface A Reth is a Junos aggregate Ethernet interface and it has special properties compared to a traditional aggregate Ethernet interface. The Reth allows the administrator to add one or more child links per
Reth MAC AddressThe MAC address for the Reth isbased on a combination of thecluster ID and the Reth number.
Count… In the figure the first four of the six bytes are fixed. They do not change between cluster deployments. The last two bytes vary based on the cluster ID and the Reth index.
Local Interface A local interface is an interface that is configured local to a specific node. This method of configuration on an interface is the same method of configuration on a standalone device.
Count…The significance of a local interfacein an SRX cluster is that it does nothave a backup interface on theother chassis, meaning that it is partof neither a Reth nor a redundancygroup.If this interface were to fail, its IPaddress would not fail over to theother node.
Troubleshooting the ClusterThere are various methods thatshow the administrator how totroubleshoot a chassis cluster:1. Identify the Cluster Status2. Checking Interfaces3. Verifying the Data Plane4. Core Dumps5. The Dreaded Priority Zero