HA, SRX Cluster &Redundancy Groups
High Availability ClusterHigh-availability clusters (also knownas HA clusters or failover clusters) aregroups of computers...
Uses of HA ClusterHA clusters are often used for:1.   Critical Databases2.   File Sharing on a Network3.   Business Applic...
Cluster MonitoringHA clusters usually use a heartbeatprivate network connection which isused to monitor the health and sta...
Application DesignRequirementsIn order to run in a high-availability clusterenvironment, an application must satisfyat lea...
Count… Ability to restart on another node at the last state before failure using the saved state from the shared storage....
Node ConfigurationsIn two-node cluster configurations cansometimes be categorized into one ofthe following models:1.   Act...
Active/activeTraffic intended for the failed node iseither passed onto an existing nodeor load balanced across theremainin...
Active/passiveProvides a fully redundant instanceof each node, which is only broughtonline when its associated primarynode...
Node ReliabilityHA clusters usually utilize all availabletechniques to make the individualsystems and shared infrastructur...
Failover StrategiesSystems that handle failures indistributed computing have differentstrategies to cure a failure. Forins...
What is SRX Cluster…?SRX Cluster provides network noderedundancy by grouping a pair of thesame kind of supported SRX Serie...
SRX Cluster Example
SRX PlaneThe SRX has a separated planes.Depending on the SRX platformarchitecture, the separation variesfrom being separat...
Control Plane The control plane is used in HA to synchronize the kernel state between the two REs. It also provides a pa...
Count… The control plane is always in an active/backup state. This means only one RE can be the master over the cluster’s...
Control Plane States
Data Plane The data plane’s responsibility in the SRX is to pass data and processes based on the administrator’s configur...
Responsibilities of Data            Plane    The data plane has a few    responsibilities when it comes to HA    implemen...
Chassis ClusterAn SRX cluster implements a conceptcalled chassis cluster. A chassiscluster takes the two SRX devices andre...
Chassis Cluster Numbering
Chassis Cluster Functionality1. Resilient system architecture, with a   single active control plane for the entire   clust...
States of ClusterThe different states that a cluster can be in atany given instant are as follows:1. Hold2. Primary3. Seco...
Chassis Cluster FormationTo form a chassis cluster, a pair of thesame kind of supported SRX Seriesdevices or J Series devi...
Identification of ClustersClusters and nodes are identified inthe following way: A cluster is identified by a cluster  ID...
Redundancy GroupsA redundancy group is an abstractconstruct that includes and manages acollection of objects. A redundancy...
Example of RedundancyGroups
Primacy of RedundancyGroupThree things determine the primacy of aredundancy group:1. The priority configured for the node2...
Redundancy GroupMonitoringA redundancy group is automatically failover to another node, for this it has tomonitor some fol...
Chassis Cluster Redundancy          Group FailoverA redundancy group is a collection ofobjects that fail over as a group.E...
Count…    When a monitored object fails, the    weight of the object is subtracted from    the threshold value of the red...
Count…    Because back-to-back redundancy    group failovers that occur too quickly    can cause a cluster to exhibit    ...
Count…Redundancy groups x (redundancygroups numbered 1 through 128)have a default dampening time of 1second, with a range...
Chassis Cluster Redundancy      Group Manual FailoverWe can initiate a redundancy groupx failover manually. A manual fail...
State Transitions CasesThere are three transition cases:1. Reboot case—The node in thesecondary-hold state transitions to ...
Count…2. Control link failure case—The nodein the secondary-hold state transitionsto the ineligible state and then to adis...
SNMP Failover TrapsChassis clustering supports SNMPtraps, which are triggered whenever there isa redundancy group failover...
Chassis Cluster InterfacesA network device doesn’t help anetwork without participating in trafficprocessing.An SRX has two...
Reth Interface A Reth is a Junos aggregate Ethernet interface and it has special properties compared to a traditional agg...
Reth MAC AddressThe MAC address for the Reth isbased on a combination of thecluster ID and the Reth number.
Count… In the figure the first four of the six bytes are fixed. They do not change between cluster deployments. The last...
Local Interface A local interface is an interface that is configured local to a specific node. This method of configurat...
Count…The significance of a local interfacein an SRX cluster is that it does nothave a backup interface on theother chass...
Troubleshooting the ClusterThere are various methods thatshow the administrator how totroubleshoot a chassis cluster:1. Id...
HA, SRX Cluster & Redundancy Groups
Upcoming SlideShare
Loading in …5
×

HA, SRX Cluster & Redundancy Groups

7,317 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,317
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
256
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

HA, SRX Cluster & Redundancy Groups

  1. 1. HA, SRX Cluster &Redundancy Groups
  2. 2. High Availability ClusterHigh-availability clusters (also knownas HA clusters or failover clusters) aregroups of computers that supportserver applications that can bereliably utilized with a minimum ofdown-time. They operate by harnessing redundant computers in groups or clusters that provide continued service when system components
  3. 3. Uses of HA ClusterHA clusters are often used for:1. Critical Databases2. File Sharing on a Network3. Business Applications4. Customer Services such as electronic commerce websites
  4. 4. Cluster MonitoringHA clusters usually use a heartbeatprivate network connection which isused to monitor the health and status ofeach node in the cluster.
  5. 5. Application DesignRequirementsIn order to run in a high-availability clusterenvironment, an application must satisfyat least the following technicalrequirements: There must be a relatively easy way to start, stop, force-stop, and check the status of the application. The application must be able to use shared storage.
  6. 6. Count… Ability to restart on another node at the last state before failure using the saved state from the shared storage. The application must not corrupt data if it crashes, or restarts from the saved state.
  7. 7. Node ConfigurationsIn two-node cluster configurations cansometimes be categorized into one ofthe following models:1. Active/active2. Active/passive3. N+14. N+M5. N-to-16. N-to-N
  8. 8. Active/activeTraffic intended for the failed node iseither passed onto an existing nodeor load balanced across theremaining nodes.This is usually only possible whenthe nodes utilize a homogeneoussoftware configuration.
  9. 9. Active/passiveProvides a fully redundant instanceof each node, which is only broughtonline when its associated primarynode fails.This configuration typically requiresthe most extra hardware.
  10. 10. Node ReliabilityHA clusters usually utilize all availabletechniques to make the individualsystems and shared infrastructure asreliable as possible. These include:1. Disk Mirroring2. Redundant Network3. Redundant Storage Area Network4. Redundant Electrical Power5. Redundant Power Supply Units
  11. 11. Failover StrategiesSystems that handle failures indistributed computing have differentstrategies to cure a failure. Forinstance, API defines three ways toconfigure a failover:1. FAIL_FAST: The try fails if the first node cannot be reached.2. ON_FAIL_TRY_ONE_NEXT_AVAILABL E: Tries one more host before giving up3. ON_FAIL_TRY_ALL_AVAILABLE: Tries all existing nodes before giving up
  12. 12. What is SRX Cluster…?SRX Cluster provides network noderedundancy by grouping a pair of thesame kind of supported SRX Seriesdevices or J Series devices into acluster. The devices must be running the same version of Junos OS.
  13. 13. SRX Cluster Example
  14. 14. SRX PlaneThe SRX has a separated planes.Depending on the SRX platformarchitecture, the separation variesfrom being separate processesrunning on separate cores tocompletely physically differentiatedsubsystems.1. Control Plane2. Data Plane
  15. 15. Control Plane The control plane is used in HA to synchronize the kernel state between the two REs. It also provides a path between the two devices to send hello messages between them. The two devices’ control planes talk to each other over a control link. This link is reserved for control plane communication.
  16. 16. Count… The control plane is always in an active/backup state. This means only one RE can be the master over the cluster’s configuration and state. This ensures that there is only one ultimate truth over the state of the cluster. If the primary RE fails, the secondary takes over for it. Creating an active/active control plane makes synchronization more difficult because many checks would need to be put in place to validate which RE is right.
  17. 17. Control Plane States
  18. 18. Data Plane The data plane’s responsibility in the SRX is to pass data and processes based on the administrator’s configuration. All session and service states are maintained on the data plane. The REs and/or control plane are not responsible for maintaining state.
  19. 19. Responsibilities of Data Plane The data plane has a few responsibilities when it comes to HA implementation. First and foremost is state synchronization. The state of sessions and services is shared between the two devices. Sessions are the state of the current set of traffic that is going through the SRX, and services are other items such as:1. VPN2. IPS3. ALGs
  20. 20. Chassis ClusterAn SRX cluster implements a conceptcalled chassis cluster. A chassiscluster takes the two SRX devices andrepresents them as a single device. The interfaces are numbered in such a way that they are counted starting at the first chassis and then end on the second chassis.
  21. 21. Chassis Cluster Numbering
  22. 22. Chassis Cluster Functionality1. Resilient system architecture, with a single active control plane for the entire cluster and multiple Packet Forwarding Engines. This architecture presents a single device view of the cluster.2. Synchronization of configuration and dynamic runtime states between nodes within a cluster.3. Monitoring of physical interfaces, and failover if the failure parameters cross a configured threshold.
  23. 23. States of ClusterThe different states that a cluster can be in atany given instant are as follows:1. Hold2. Primary3. Secondary-Hold4. Secondary5. Ineligible6. DisabledA state transition can be triggered because ofany event, such as interface monitoring, SPUmonitoring, failures, and manual failovers.
  24. 24. Chassis Cluster FormationTo form a chassis cluster, a pair of thesame kind of supported SRX Seriesdevices or J Series devices arecombined to act as a single systemthat enforces the same overall security. You can deploy up to 15 chassis clusters in a Layer 2 domain.
  25. 25. Identification of ClustersClusters and nodes are identified inthe following way: A cluster is identified by a cluster ID (cluster-id) specified as a number from 1 through15. A cluster node is identified by a node ID (node) specified as a number from 0 to 1.
  26. 26. Redundancy GroupsA redundancy group is an abstractconstruct that includes and manages acollection of objects. A redundancygroup contains objects on both nodes. A redundancy group is primary on one node and backup on the other at any time. We can create up to 128 redundancy groups.
  27. 27. Example of RedundancyGroups
  28. 28. Primacy of RedundancyGroupThree things determine the primacy of aredundancy group:1. The priority configured for the node2. The node ID (in case of tied priorities)3. The order in which the node comes up.If a lower priority node comes upfirst, then it will assume the primacy for aredundancy group (and will stay asprimary if preempt is not enabled).
  29. 29. Redundancy GroupMonitoringA redundancy group is automatically failover to another node, for this it has tomonitor some following components of theChassis Cluster:1. Interface Monitoring2. IP Address Monitoring3. Monitoring of Global-Level Objects i. SPU Monitoring ii. Flowd Monitoring iii. Cold-Sync Monitoring
  30. 30. Chassis Cluster Redundancy Group FailoverA redundancy group is a collection ofobjects that fail over as a group.Each redundancy group monitors aset of objects (physicalinterfaces), and each monitoredobject is assigned a weight. Each redundancy group has an initial threshold of 255.
  31. 31. Count… When a monitored object fails, the weight of the object is subtracted from the threshold value of the redundancy group. When the threshold value reaches zero, the redundancy group fails over to the other node. As a result, all the objects associated with the redundancy group fail over as well.
  32. 32. Count… Because back-to-back redundancy group failovers that occur too quickly can cause a cluster to exhibit unpredictable behavior, a dampening time between failovers is needed. The default dampening time is 300 seconds (5 minutes) for redundancy group 0 and is configurable to up to 1800 seconds with the hold-down- interval statement.
  33. 33. Count…Redundancy groups x (redundancygroups numbered 1 through 128)have a default dampening time of 1second, with a range of 0 through1800 seconds.The hold-down interval affectsmanual failovers, as well asautomatic failovers associated withmonitoring failures.
  34. 34. Chassis Cluster Redundancy Group Manual FailoverWe can initiate a redundancy groupx failover manually. A manual failoverapplies until a failback event occurs.You can also initiate a redundancygroup 0 failover manually if you wantto change the primary node forredundancy group 0.
  35. 35. State Transitions CasesThere are three transition cases:1. Reboot case—The node in thesecondary-hold state transitions to theprimary state; the other node goesdead (inactive).
  36. 36. Count…2. Control link failure case—The nodein the secondary-hold state transitionsto the ineligible state and then to adisabled state; the other nodetransitions to the primary state.3. Fabric link failure case—The nodein the secondary-hold state transitionsdirectly to the disabled state.
  37. 37. SNMP Failover TrapsChassis clustering supports SNMPtraps, which are triggered whenever there isa redundancy group failover.The trap message can help youtroubleshoot failovers. It contains thefollowing information:1. The cluster ID and node ID2. The reason for the failover3. The redundancy group that is involved in the failover4. The redundancy group’s previous state and current state
  38. 38. Chassis Cluster InterfacesA network device doesn’t help anetwork without participating in trafficprocessing.An SRX has two different interfacetypes that it can use to process trafficthat are:1. Reth Interface2. Local Interface
  39. 39. Reth Interface A Reth is a Junos aggregate Ethernet interface and it has special properties compared to a traditional aggregate Ethernet interface. The Reth allows the administrator to add one or more child links per
  40. 40. Reth MAC AddressThe MAC address for the Reth isbased on a combination of thecluster ID and the Reth number.
  41. 41. Count… In the figure the first four of the six bytes are fixed. They do not change between cluster deployments. The last two bytes vary based on the cluster ID and the Reth index.
  42. 42. Local Interface A local interface is an interface that is configured local to a specific node. This method of configuration on an interface is the same method of configuration on a standalone device.
  43. 43. Count…The significance of a local interfacein an SRX cluster is that it does nothave a backup interface on theother chassis, meaning that it is partof neither a Reth nor a redundancygroup.If this interface were to fail, its IPaddress would not fail over to theother node.
  44. 44. Troubleshooting the ClusterThere are various methods thatshow the administrator how totroubleshoot a chassis cluster:1. Identify the Cluster Status2. Checking Interfaces3. Verifying the Data Plane4. Core Dumps5. The Dreaded Priority Zero

×