Knowledge Search


×
 

Resolution Guide - SRX - Troubleshooting steps when the Chassis Cluster does not come up

  [KB20641] Show Article Properties


Summary:

This article contains step-by-step troubleshooting procedures to resolve when a node in a Chassis Cluster is in a Hold or Disabled state.  This article is part of the Resolution Guides and Articles - SRX - High Availability (Chassis Cluster).


Symptoms:

Symptoms:

  • Chassis Cluster is not coming up
  • Chassis Cluster is not in Primary/Secondary State
  • Both HA (High Availability) members are in Primary State
  • One HA member is in Hold or Disable or Secondary-Hold State
If your Chassis Cluster is up and running, but you want to simply verify that it is in a healthy state, please refer to  KB15439 - Verify Chassis Cluster is in healthy state or Verifying the Chassis Cluster Configuration.

Cause:

Solution:

Perform the following steps to troubleshoot your Chassis Cluster.

step1  Are you configuring the Chassis Cluster for the first time?

  • Yes - Continue with Step 2
  • No   - Proceed to Step 3 to begin troubleshooting
    (Selecting No means that the Chassis Cluster was previously up, and it went down due to reasons which need to be investigated.)

step2  As a new configuration, check the following to make sure basic configuration guidelines are being followed:

  1. Confirm the Hardware and Software requirements for your Chassis Cluster. Refer to the following articles to make sure the basic software and hardware requirements are satisfied in your Chassis Cluster scenario:
    KB16141 - Minimum hardware and software requirements for a Chassis Cluster
    KB15425 - Are licenses needed for each node of a Chassis Cluster?

    Note:  Currently, Chassis Clustering will be supported on SRX 1400 and SRX 220, beginning with the Junos OS 11.x family releases. Please contact your technical support representative if you need further clarification on this.

  2. Confirm that the features running on your SRX device are supported with Chassis Clustering.  Refer to the Feature Support Reference guide located here under your specific Junos OS version.

    If you are running unsupported features for a Chassis Cluster, deactivate or remove them before proceeding to create a Chassis Cluster.

  3. Make sure that the cabling is correct, and the Control and Fabric links are up. Direct connections between the fabric and control links are recommended.
    //sample output showing the control and fabric links as up
    
    {primary:node0}
    root@J-SRX> show interfaces terse | match fxp 
    fxp0                    up    up  
    fxp0.0                  up    up   inet     10.2.2.1/24     
    fxp1                    up    up  
    fxp1.0                  up    up   inet     129.16.0.1/2    
    fxp2                    up    up  
    fxp2.0                  up    up   tnp      0x1100001
        
    root@J-SRX> show interfaces terse | match fab 
    ge-0/0/2.0              up    up   aenet    --> fab0.0
    ge-9/0/2.0              up    up   aenet    --> fab1.0
    fab0                    up    up  
    fab0.0                  up    up   inet     30.17.0.200/24  
    fab1                    up    up  
    fab1.0                  up    up   inet     30.18.0.200/24  

    Note: The Control and Fabric links differ with the hardware platforms. Make sure that the correct ports are used for connecting the Control and Fabric links.


    If you find that the Control or Fabric links are showing down, refer to the following articles to troubleshoot this issue further:

  4. Confirm the Chassis Cluster configuration.  Refer to KB15439 - How do I verify chassis cluster nodes are configured and up on J-Series and SRX.

  5. Reboot both Chassis Cluster nodes simultaneously. This should ensure a clean cluster state. If the issue is still not resolved, proceed to Step 3.


step3  Run the command 'show chassis cluster status' to check the current status of the Chassis Cluster:

{primary:node0}
root@J-SRX> show chassis cluster status 
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   100         secondary      no       no  
    node1                   1           primary        no       no  

Redundancy group: 1 , Failover count: 1
    node0                   100         secondary      no       no  
    node1                   1           primary        no       no  
  1. Do you see a Cluster ID for the Chassis Cluster output (as shown above in blue)?

  2. Do you see both node0 and node1 in the output (as shown above in blue)?

    • Yes - Proceed with Step 6.
    • No   - Proceed with Step 4.


step4  If you do not see both the nodes in the cluster status output (as shown in Step 3), it could mean that the hardware/software components are different on both nodes. Are the components the same for both nodes? 

        Make sure that the hardware components on both devices remain the same, the software versions are the same, and the  interfaces being used as part of reth are logically the same.


step5  Is the Cluster ID the same on both nodes?  In order to check the Cluster ID value, connect a console to both nodes. Run the command 'show chassis cluster status'.

  • No   - Set the Cluster ID on both nodes to the same value. This is a requirement.  See note below. Then reboot both devices simultaneously. If this does not resolve the issue, go to Step 7.

    Note: If you have more than one Chassis Cluster on the same switch or L2 domain, then each pair of Chassis Cluster nodes must have a different Cluster ID. For example, if there are two pairs of Chassis Cluster nodes connected to the same switch -- Juniper_mktg (node0 and node1) is one pair, and Juniper_eng (node 0 and node1) is another pair. Juniper_mktg node0 and node1 may be assigned the Cluster ID of 1. Juniper_mktg node0 and node1 may be assigned the Cluster ID of 2; Juniper_mktg should not be assigned a Cluster ID of 1 because the other pair is using 1. This is because the reth MAC addresses are calculated based on the cluster IDs and two similar cluster IDs in the same network might cause a network impact due to overlapping virtual MAC entries.

  • Yes - Run the command show chassis cluster interfaces.  Are the Control and Fabric link status Up (as shown in blue below)?
    {primary:node0}
    root@J-SRX> show chassis cluster interfaces
    Control link 0 name: fxp1
    Control link status: Up

    Fabric interfaces:
    Name Child-interface Status
    fab0 ge-0/0/2 up
    fab0
    fab1 ge-9/0/2 up
    fab1
    Fabric link status: Up
    Redundant-ethernet Information:     
        Name         Status      Redundancy-group
        reth0        Down        1                
        reth1        Down        1                
        reth2        Down        Not configured                
        reth3        Down        Not configured                              
    
    Interface Monitoring:
        Interface         Weight    Status    Redundancy-group
        ge-2/0/1          255       Down      1  
        ge-11/0/1         255       Up        1   
        ge-11/0/0         255       Down      1   
        ge-2/0/0          255       Down      1   
    
    If the Control and Fabric Link status is not Up, refer to the following articles to troubleshoot this issue:
    KB20687 - How to troubleshoot a Fabric Link that is down on a Chassis Cluster
    KB20698 - How to troubleshoot a Control Link that is down on a Chassis Cluster
    If the Control and Fabric Link shows as up, proceed to Step 1, to rework on narrowing down the issue. Further goto Step 7 to open a case with JTAC

step6  What is the current state of the Chassis Cluster?  Proceed to the next troubleshooting steps based on the state that you see for node 0 and node 1 respectively.

  1. Primary/Secondary -> This is the expected state for a healthy cluster. Proceed to KB20673 - How to verify that Chassis Cluster in Primary/Secondary State has proper priority. This is the final check for a healthy cluster state.
  2. Primary/Lost -> Proceed to KB20672 - Troubleshooting steps if the Chassis Cluster in Primary/Lost State.

  3. Primary/Hold -> The reason could be because the JSRP daemon is stuck on one of the nodes. Either simultaneously reboot both devices, or open a case with your technical support representative. Consult KB21781 - [SRX] Data Collection Checklist - Logs/data to collect for troubleshooting.
  4. Hold/Lost -> Refer to KB27713 How to recover or prevent a chassis cluster from going into a Hold/Lost state

  5. Primary/Disabled -> Proceed to KB20697 - Troubleshooting steps if the Chassis Cluster is in Primary/Disabled State.
  6. Primary or Secondary in Hold state -> This could be a temporary behavior. Check the output of  the command chassis fpc pic status. The available PICs on both nodes should show as online. Wait for some time for the PICs to come online on both nodes, and the status should change to Primary/Secondary. If the situation does not improve, please proceed to Step 7


step7  If you want to determine the cause of a failover, refer to KB21164 - [SRX] Finding out possible reasons for Chassis Cluster failover.

If the above steps do not resolve this problem, refer to KB21781 - [SRX] Data Collection Checklist - Logs/data to collect for troubleshooting in order to collect the necessary logs from both devices, and open a case with with your technical support representative.


Related Links: