Your SlideShare is downloading. ×
HA, SRX Cluster & Redundancy Groups
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

HA, SRX Cluster & Redundancy Groups


Published on

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. HA, SRX Cluster &Redundancy Groups
  • 2. High Availability ClusterHigh-availability clusters (also knownas HA clusters or failover clusters) aregroups of computers that supportserver applications that can bereliably utilized with a minimum ofdown-time. They operate by harnessing redundant computers in groups or clusters that provide continued service when system components
  • 3. Uses of HA ClusterHA clusters are often used for:1. Critical Databases2. File Sharing on a Network3. Business Applications4. Customer Services such as electronic commerce websites
  • 4. Cluster MonitoringHA clusters usually use a heartbeatprivate network connection which isused to monitor the health and status ofeach node in the cluster.
  • 5. Application DesignRequirementsIn order to run in a high-availability clusterenvironment, an application must satisfyat least the following technicalrequirements: There must be a relatively easy way to start, stop, force-stop, and check the status of the application. The application must be able to use shared storage.
  • 6. Count… Ability to restart on another node at the last state before failure using the saved state from the shared storage. The application must not corrupt data if it crashes, or restarts from the saved state.
  • 7. Node ConfigurationsIn two-node cluster configurations cansometimes be categorized into one ofthe following models:1. Active/active2. Active/passive3. N+14. N+M5. N-to-16. N-to-N
  • 8. Active/activeTraffic intended for the failed node iseither passed onto an existing nodeor load balanced across theremaining nodes.This is usually only possible whenthe nodes utilize a homogeneoussoftware configuration.
  • 9. Active/passiveProvides a fully redundant instanceof each node, which is only broughtonline when its associated primarynode fails.This configuration typically requiresthe most extra hardware.
  • 10. Node ReliabilityHA clusters usually utilize all availabletechniques to make the individualsystems and shared infrastructure asreliable as possible. These include:1. Disk Mirroring2. Redundant Network3. Redundant Storage Area Network4. Redundant Electrical Power5. Redundant Power Supply Units
  • 11. Failover StrategiesSystems that handle failures indistributed computing have differentstrategies to cure a failure. Forinstance, API defines three ways toconfigure a failover:1. FAIL_FAST: The try fails if the first node cannot be reached.2. ON_FAIL_TRY_ONE_NEXT_AVAILABL E: Tries one more host before giving up3. ON_FAIL_TRY_ALL_AVAILABLE: Tries all existing nodes before giving up
  • 12. What is SRX Cluster…?SRX Cluster provides network noderedundancy by grouping a pair of thesame kind of supported SRX Seriesdevices or J Series devices into acluster. The devices must be running the same version of Junos OS.
  • 13. SRX Cluster Example
  • 14. SRX PlaneThe SRX has a separated planes.Depending on the SRX platformarchitecture, the separation variesfrom being separate processesrunning on separate cores tocompletely physically differentiatedsubsystems.1. Control Plane2. Data Plane
  • 15. Control Plane The control plane is used in HA to synchronize the kernel state between the two REs. It also provides a path between the two devices to send hello messages between them. The two devices’ control planes talk to each other over a control link. This link is reserved for control plane communication.
  • 16. Count… The control plane is always in an active/backup state. This means only one RE can be the master over the cluster’s configuration and state. This ensures that there is only one ultimate truth over the state of the cluster. If the primary RE fails, the secondary takes over for it. Creating an active/active control plane makes synchronization more difficult because many checks would need to be put in place to validate which RE is right.
  • 17. Control Plane States
  • 18. Data Plane The data plane’s responsibility in the SRX is to pass data and processes based on the administrator’s configuration. All session and service states are maintained on the data plane. The REs and/or control plane are not responsible for maintaining state.
  • 19. Responsibilities of Data Plane The data plane has a few responsibilities when it comes to HA implementation. First and foremost is state synchronization. The state of sessions and services is shared between the two devices. Sessions are the state of the current set of traffic that is going through the SRX, and services are other items such as:1. VPN2. IPS3. ALGs
  • 20. Chassis ClusterAn SRX cluster implements a conceptcalled chassis cluster. A chassiscluster takes the two SRX devices andrepresents them as a single device. The interfaces are numbered in such a way that they are counted starting at the first chassis and then end on the second chassis.
  • 21. Chassis Cluster Numbering
  • 22. Chassis Cluster Functionality1. Resilient system architecture, with a single active control plane for the entire cluster and multiple Packet Forwarding Engines. This architecture presents a single device view of the cluster.2. Synchronization of configuration and dynamic runtime states between nodes within a cluster.3. Monitoring of physical interfaces, and failover if the failure parameters cross a configured threshold.
  • 23. States of ClusterThe different states that a cluster can be in atany given instant are as follows:1. Hold2. Primary3. Secondary-Hold4. Secondary5. Ineligible6. DisabledA state transition can be triggered because ofany event, such as interface monitoring, SPUmonitoring, failures, and manual failovers.
  • 24. Chassis Cluster FormationTo form a chassis cluster, a pair of thesame kind of supported SRX Seriesdevices or J Series devices arecombined to act as a single systemthat enforces the same overall security. You can deploy up to 15 chassis clusters in a Layer 2 domain.
  • 25. Identification of ClustersClusters and nodes are identified inthe following way: A cluster is identified by a cluster ID (cluster-id) specified as a number from 1 through15. A cluster node is identified by a node ID (node) specified as a number from 0 to 1.
  • 26. Redundancy GroupsA redundancy group is an abstractconstruct that includes and manages acollection of objects. A redundancygroup contains objects on both nodes. A redundancy group is primary on one node and backup on the other at any time. We can create up to 128 redundancy groups.
  • 27. Example of RedundancyGroups
  • 28. Primacy of RedundancyGroupThree things determine the primacy of aredundancy group:1. The priority configured for the node2. The node ID (in case of tied priorities)3. The order in which the node comes up.If a lower priority node comes upfirst, then it will assume the primacy for aredundancy group (and will stay asprimary if preempt is not enabled).
  • 29. Redundancy GroupMonitoringA redundancy group is automatically failover to another node, for this it has tomonitor some following components of theChassis Cluster:1. Interface Monitoring2. IP Address Monitoring3. Monitoring of Global-Level Objects i. SPU Monitoring ii. Flowd Monitoring iii. Cold-Sync Monitoring
  • 30. Chassis Cluster Redundancy Group FailoverA redundancy group is a collection ofobjects that fail over as a group.Each redundancy group monitors aset of objects (physicalinterfaces), and each monitoredobject is assigned a weight. Each redundancy group has an initial threshold of 255.
  • 31. Count… When a monitored object fails, the weight of the object is subtracted from the threshold value of the redundancy group. When the threshold value reaches zero, the redundancy group fails over to the other node. As a result, all the objects associated with the redundancy group fail over as well.
  • 32. Count… Because back-to-back redundancy group failovers that occur too quickly can cause a cluster to exhibit unpredictable behavior, a dampening time between failovers is needed. The default dampening time is 300 seconds (5 minutes) for redundancy group 0 and is configurable to up to 1800 seconds with the hold-down- interval statement.
  • 33. Count…Redundancy groups x (redundancygroups numbered 1 through 128)have a default dampening time of 1second, with a range of 0 through1800 seconds.The hold-down interval affectsmanual failovers, as well asautomatic failovers associated withmonitoring failures.
  • 34. Chassis Cluster Redundancy Group Manual FailoverWe can initiate a redundancy groupx failover manually. A manual failoverapplies until a failback event occurs.You can also initiate a redundancygroup 0 failover manually if you wantto change the primary node forredundancy group 0.
  • 35. State Transitions CasesThere are three transition cases:1. Reboot case—The node in thesecondary-hold state transitions to theprimary state; the other node goesdead (inactive).
  • 36. Count…2. Control link failure case—The nodein the secondary-hold state transitionsto the ineligible state and then to adisabled state; the other nodetransitions to the primary state.3. Fabric link failure case—The nodein the secondary-hold state transitionsdirectly to the disabled state.
  • 37. SNMP Failover TrapsChassis clustering supports SNMPtraps, which are triggered whenever there isa redundancy group failover.The trap message can help youtroubleshoot failovers. It contains thefollowing information:1. The cluster ID and node ID2. The reason for the failover3. The redundancy group that is involved in the failover4. The redundancy group’s previous state and current state
  • 38. Chassis Cluster InterfacesA network device doesn’t help anetwork without participating in trafficprocessing.An SRX has two different interfacetypes that it can use to process trafficthat are:1. Reth Interface2. Local Interface
  • 39. Reth Interface A Reth is a Junos aggregate Ethernet interface and it has special properties compared to a traditional aggregate Ethernet interface. The Reth allows the administrator to add one or more child links per
  • 40. Reth MAC AddressThe MAC address for the Reth isbased on a combination of thecluster ID and the Reth number.
  • 41. Count… In the figure the first four of the six bytes are fixed. They do not change between cluster deployments. The last two bytes vary based on the cluster ID and the Reth index.
  • 42. Local Interface A local interface is an interface that is configured local to a specific node. This method of configuration on an interface is the same method of configuration on a standalone device.
  • 43. Count…The significance of a local interfacein an SRX cluster is that it does nothave a backup interface on theother chassis, meaning that it is partof neither a Reth nor a redundancygroup.If this interface were to fail, its IPaddress would not fail over to theother node.
  • 44. Troubleshooting the ClusterThere are various methods thatshow the administrator how totroubleshoot a chassis cluster:1. Identify the Cluster Status2. Checking Interfaces3. Verifying the Data Plane4. Core Dumps5. The Dreaded Priority Zero