Knowledge Search


×
 

SRX Getting Started - Troubleshoot High Availability (HA)

  [KB15911] Show Article Properties


Summary:

This article addresses troubleshooting a SRX chassis cluster (SRX High Availability).  For chassis cluster configuration, refer to KB15650 - SRX Getting Started - Configure Chassis Cluster (High Availability).

For a step-by-step chassis cluster troubleshooting approach, refer to Resolution Guides and Articles - SRX - High Availability (Chassis Cluster).
 
Symptoms:

Troubleshoot SRX chassis cluster.

Solution:

When working with chassis cluster configurations, the most common SRX high availability issues are due to basic configuration or architectural issues, so common clustering issues will be examined first, followed by various commands that can be used to check the HA state, then the debugging facilities will be delved into.

Click any of the following topics to jump to them:


Establishing Chassis Cluster Tips

  1. Is chassis clustering enabled?  
    Check the output of the “show chassis cluster status” to determine the status of a chassis cluster.  

    If chassis clustering is not enabled, the following will be displayed: 
    root@SRX210> show chassis cluster status
    error: Chassis cluster is not enabled.
    If chassis clustering is enabled, the output will look something like the following:
    root@SRX5800-1> show chassis cluster status
    Cluster ID: 1
    Node name                  Priority     Status    Preempt  Manual failover
    
    Redundancy group: 0 , Failover count: 1
        node0                   1           primary   no       no
        node1                   1           secondary no       no
    
    Redundancy group: 1 , Failover count: 1
        node0                   254         primary   no       no
        node1                   1           secondary no       n 
  2. Is there 'like' hardware in both nodes (chassis members)?
    A hardware mismatch could result in a coldsync failure or one of the nodes could be in the disabled state.
    In a chassis cluster environment, each node must have the same hardware. For SRX1400, SRX3000 and SRX5000, you MUST have same number of SPCs per node and each SPC MUST reside on same slot location on each node. For other situations, there are some exceptions:
    • On the SRX5600 and SRX5800, it does not strictly matter which slots are used for the IOC cards (however recommended that they match for simplicity sake) so long as you have the same number and type of cards exist in both chassis nodes.
    • On the SRX1400, SRX3400, SRX3600, and SRX Branch products (SRX100, SRX210, SRX220, SRX240, SRX550 and SRX650), the same hardware is required in both cluster nodes, in the same slots.

  3. Is the Junos version the same on both nodes?
    A software mismatch could result in nodes not seeing each other ("split-brain") or other unpredictable behavior.
    Each node of a SRX chassis cluster must be running the same version of Junos. .

  4. Have both nodes in the chassis cluster been rebooted?
    To setup a chassis cluster, each chassis cluster node must be rebooted before they can join the cluster. If you do not reboot the nodes, then the member will not be activated. This can be checked with the command show chassis cluster status to check the status of the cluster nodes. Also, if you RMA either a device or switch fabric card for SRX1400, SRX3000 and SRX5000 you will need to issue this command again on the replaced unit, since the setting is stored in NVRAM and not in the configuration itself. Hence when the device is replaced, it will not have this setting.

  5. Is the control link in the appropriate port on the SRX?
    For information on how ports are assigned on a chassis cluster, refer to KB15356 - How are interfaces assigned on J-Series and SRX platforms when the chassis cluster is enabled?
    • On the SRX3400 and SRX3600, the control port is fixed to HA port 0 which is on the SFB board. 
    • From Junos 11.1, High Availability is supported on SRX1400. This is different from the SRX3000 device, which has dedicated control port for the ge-0/0/10 and ge-0/0/11 control links that will be used as control ports, when the Chassis Cluster is enabled.
    • On the SRX5600 and SRX5800, you must configure which ports are used for the control ports. You can use control port 0 on any SPC card. However we recommend using an SPC which is NOT the CP. The control port SPC must be on same slot location on each node.
    • SRX5600 and SRX5800 supports dual control links beginning with Junos 10.0. However, a second RE docked in CB slot 1 on each node is required to support this. With dual control links you may use control port 1 on an SPC. However we recommend using two different SPCs for both control links for maximum redundancy.
    • SRX3400 and SRX3600 supports dual control links beginning with Junos 10.2. However an SRX Clustering Module (SCM) is required to be installed on each node to support this. You cannot use a second RE to support this function.
    • For SRX1400 devices, ports 10 and 11 on the SYSIO can be configured as dual control links. When the devices are not in cluster mode, these ports can be used as revenue ports.
    • On SRX5400, SRX5600 and SRX5800 you must use a fiber SFP link. On the SRX1400, you can use copper or fiber SFP link for ge-0/0/10, but you can use fiber SFP only for ge-0/0/11. On SRX3400 and SRX3600, you can use copper or fiber SFP link.
    • On the J-Series and SRX Branch devices, the control link will vary depending on the platform. For more information, refer to KB15356 - How are interfaces assigned on J-Series and SRX platforms when the chassis cluster is enabled?

  6. If using dual control links are you running the correct version, and do you have dual routing engines?
    On the SRX high-end devices (SRX5600, SRX5800, SRX3400 and SRX3600), the SRX supports dual control links in Junos version 10.0 or 10.2 and beyond (see step 5 above). However you must have two routing engines in each cluster member for SRX5000 or SCM module on each node for SRX3000. The second RE on SRX5000 is not used for backup routing today, but is used to activate the control link port in the internal switch. SRX3000 and SRX1400 does not support second RE at this time.

  7. Is the data link properly configured on the SRX?
    Unlike ScreenOS, the SRX requires separate links for the control and datalink (called fabric link on SRX) as SRX requires separate connections to the control and data plane. For the data fabric ports, this is done by using any available data plane port (revenue port). This must be configured manually to specify which interface will be the fabric port. If using Active/Passive, 1 Gbps will be more than enough for the function; 10 Gbps has no advantage. However in Active/Active, if you are going to have data arrive on an interface on one chassis member, and cross the datalink to exit an egress interface on the other member, then a 10 Gbps link may be required to maximize the throughput. The fabric link must be established for the HA communication to be fully supported, since it is responsible for synchronizing the real time objects to the other member. Also beginning with Junos version 10.2, SRX can support a second fabric link per node.

  8. If multiple SRX clusters exist on the same L2 broadcast domain, is the same cluster ID used?
    If you have multiple SRX clusters on the same L2 broadcast domain, you must use different cluster ID numbers, because the cluster ID is used to form the virtual MAC address that is used for the RETH interface. Therefore if you use the same cluster ID then you will have a MAC address overlap and forwarding problems will occur. The figure below shows how the MAC address is calculated for a RETH interface.
 

Failover Behavior Tips

  1. Have the appropriate redundancy groups been configured on the chassis with the appropriate priorities? 
    Redundancy Group 0 must be configured for the control plane, and the higher priority is preferred over low priority. Redundancy Group 1 and higher are used for the dataplane. In order for proper failover to occur, the SRX must make sure that the appropriate priorities are configured for the appropriate node.  In question 2 there is an example of the output of the show chassis cluster status of a working and functioning SRX Chassis cluster, with the appropriate priority values, that is non-zero, for full operation. 

  2. Has the SRX had enough time to boot and complete the cold sync process? 
    Keep in mind that it takes about 5* minutes to boot up and 5* minutes to complete cold sync on a chassis, so enough time must be given to ensure for this process to complete before performing a failover.  Check the output of the show chassis cluster status command to ensure that there are no redundancy groups with a priority of 0, as shown below.

    root@SRX3400-1> show chassis cluster status
    Cluster ID: 1
    Node name Priority   Status   Preempt   Manual failover

    Redundancy group: 0 , Failover count: 1
       node0    200      primary   no        no
       node1    100      secondary no        no

    Redundancy group: 1 , Failover count: 1
       node0   200       primary   no        no
       node1   100       secondary no        no


  3. Is preempt configured for redundancy groups?
    If you enable preempt, which can be done on a redundancy group by redundancy group basis, the higher priority redundancy group will become the active member if the other member of the redundancy group has a lower priority and is active. If preempt is not enabled, then when a lower priority member becomes active after being disabled, it will not seize control of the redundancy group.

  4. Is the redundant Ethernet configuration configured within the proper redundancy group, and is that group configured with the correct priority on the correct node?
    When using redundant Ethernet, you must make sure that the redundant Ethernet is applied to the correct redundancy group, which must be RG1 or higher, and that the redundancy group has the appropriate priority values to be active on the correct node. When using Active/Passive, the control plane is RG0, while the dataplane is RG1. In Active/Active, you can have multiple redundancy groups, such as RG1, RG2, RG3, and so on, that can be active on different members and controlled by the priority setting for the nodes.

  5. After a control link failure, was a reboot performed on the disabled node to reactivate it in the cluster?
    Unless “control-link-recovery” is enabled and disabled state was caused by control link failure, you will need to manually reboot the disabled node for it to become the secondary node in the cluster. If you do not reboot the disabled member, then it will remain in the disabled state.

  6. After a fabric link failure, was a reboot performed on the disabled node to reactivate it in the cluster?
    If a data link failure occurs, then a reboot must be performed on the disabled member in order for it to become active again. Control-link-recovery option does not apply to loss of fabric link. If you do not do a reboot, then the disabled member will not be able to become active. There is no command at this point to automatically reboot when a fabric link failure occurs.

  7. After a manual failover of a redundancy group, was the manual failover flag cleared?
    When you use the command request chassis cluster failover <redundancy-group> node <new master node> to failover a redundancy group, you must clear the failover with the command request chassis cluster failover reset redundancy-group <redundancy-group>.

  8. Is the feature you are trying to use supported in High Availability clusters?
    Today there are many features that are only supported in standalone mode and not in HA. Be cognizant that some features may not be enabled in HA, so check the following references: SRX Feature Support Reference and KB14371 - [Archive] What features are not supported in Chassis Cluster on high-end SRX Series security gateway? Contact JTAC or your Juniper Sales Representative if you are still unclear or run into an issue.

     

  9. For SRX1400, SRX3400, SRX3600, SRX5600, and SRX5800 platforms, would a loss of fabric plane trigger a failover?
    A loss of fabric plane would not trigger a failover per current design. Fabric plane loss would however generate a chassis alarm. This is subject to change in future release. Check release notes in upcoming Junos releases for changes in behavior.

     

  10. Technical Documentation Reference:  Understanding Chassis Cluster Redundancy Group Failover

    Note: The boot-up and cold-sync time mentioned above is subject to the platform and the type of configuration. This time varies with the platform. For example, on a SRX5800 cold-sync time depends on the type of configuration and also the number of SPC cards present in the box.


Additional Troubleshooting

There are additional basic troubleshooting steps that you can perform:
  1. Check the status of show chassis cluster status which will display the current status of the chassis:

    root@SRX5800-1> show chassis cluster status
    Cluster ID: 1
    Node name Priority Status Preempt Manual failover
    
    Redundancy group: 0 , Failover count: 1
    node0 1 primary no no
    node1 1 secondary no no
    
    Redundancy group: 1 , Failover count: 1
    node0 254 primary no no
    node1 1 secondary no n
    
    
  2. Check the status of the participating physical and logical interfaces.
    You can do this by using the commands show interfaces <interface> terse as well as show chassis cluster interface:
    root@SRX5800-1> show interfaces terse
    Interface               Admin Link Proto    Local                 Remote
    gr-0/0/0                up    down
    ip-0/0/0                up    down
    mt-0/0/0                up    down
    pd-0/0/0                up    down
    pe-0/0/0                up    down
    ge-11/0/0               up    up
    ge-11/0/0.0             up    up   inet     200.200.200.1/24
                                       multiservice
    ge-11/0/1               up    up
    
    root@SRX5800-1> show chassis cluster interfaces
    Control link name: em0
    
    Redundant-ethernet Information:
        Name         Status      Redundancy-group
        reth0        Down        1
        reth1        Down        1
        reth2        Down        1
    
  3. Check the status of the control plane to see if you are receiving heartbeats and messages for both control and data links using the command show chassis cluster control-plane statistics:
    root@SRX5800-1> show chassis cluster control-plane statistics
    Control link statistics:
        Heartbeat packets sent: 692386
        Heartbeat packets received: 692352
    Fabric link statistics:
        Probes sent: 692381
        Probes received: 692100
     
  4. Check logs on both nodes.  These logs typically will help you identify any HA issues:
     FOR BOTH NODES:

    show log jsrpd
    show log messages
    show log chassisd   (will report hardware chassis failures)
    show log dcd

    show chassis cluster status
    show chassis cluster statistics
    show chassis cluster information
Related Links: