Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[SRX] Nodes of a cluster go into Primary/Lost state after replacement of the RE or entire system

0

0

Article ID: KB23929 KB Last Updated: 04 Mar 2017Version: 5.0
Summary:
This article is about SRX Series branch devices (SRX100, SRX210, SRX220, SRX240, SRX550and SRX650) which have code with 10.4 or later. The existing virtual LAN (VLAN) tag, which is used for control-link traffic, will be replaced with the use of experimental Ether type 0x88b5.

However, backward compatibility is also supported for devices, which have already been deployed the chassis cluster with VLAN tagging.
Symptoms:
This KB applies if you have a chassis cluster up and running with some code prior to 10.4 and done a RMA for an RE or entire node of a cluster.

New devices come with code later than 10.4. So to have a cluster, you either have to upgrade both the codes to the latest version or downgrade it to the existing cluster code. Now, after performing the upgrade/downgrade, you will see a split brain condition in the cluster.

{primary:node0}
root@node0> show chassis cluster status 
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   200         primary        no       no  
    node1                   0           lost           n/a      n/a 

Redundancy group: 1 , Failover count: 1
    node0                   200         primary        no       no  
    node1                   0           lost           n/a      n/a 

{primary:node1}
root@node1> show chassis cluster status 
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   0           lost           n/a      n/a 
    node1                   100         primary        no       no  

Redundancy group: 1 , Failover count: 1
    node0                   0           lost           n/a      n/a 
    node1                   0           primary        no       no  
 
Cause:
With codes prior to 10.4, by default, the control port tagging was enabled and it used the 4094 VLAN. For 10.4 and later codes, by default, it is disabled.

So, the upgrade/downgrade makes one node of the control port as tagged and the other node as untagged; so this causes control packets to drop, which in turn causes the Split Brain condition.


You can check if the control ports are enabled or disabled, by running  the show chassis cluster information detail command via the CLI. The chopped output of the command is as follows:
{primary:node0}
root@node0> show chassis cluster information detail 
node0:
--------------------------------------------------------------------------
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 259
        Heartbeat packets received: 0
        Heartbeat packet errors: 0
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0
    Sequence number of last heartbeat packet sent: 254
    Sequence number of last heartbeat packet received: 0
Fabric link statistics:
    Probes sent: 254
    Probes received: 0
    Probe errors: 0
    Probes not processed: 0
    Probes dropped due to control link down: 0
    Probes dropped due to fabric link down: 0
    Sequence number of last probe sent: 254
    Sequence number of last probe received: 0
Chassis cluster LED information:
    Current LED color: Red
    Last LED change reason: Peer node: node1 is not present
Control port tagging:
    Disabled

{primary:node1}
root@node1> show chassis cluster information detail 
node1:
--------------------------------------------------------------------------
Control link statistics:
    Control link 0:
                Heartbeat packets sent: 779
                Heartbeat packets received: 0
                Heartbeat packet errors: 0
                Duplicate heartbeat packets received: 0
    Control recovery packet count: 0
    Sequence number of last heartbeat packet sent: 1175
    Sequence number of last heartbeat packet received: 0
Fabric link statistics:
    Probes sent: 779
    Probes received: 0
    Probe errors: 0
    Probes not processed: 0
    Probes dropped due to control link down: 284
    Probes dropped due to fabric link down: 0f
    Sequence number of last probe sent: 1175
    Sequence number of last probe received: 0
Chassis cluster LED information:
    Current LED color: Red
    Last LED change reason: Peer node: node0 is not present
Control port tagging:     Enabled

The above output is possible for the following scenarios:

  • A user with an existing chassis cluster; Node 1 has a release previous other than 10.4 and RMA Node 0 was downgraded to the same release.

  • A user with an existing chassis cluster; Node 1 has a release previous than 10.4 and the user wants to upgrade Node 1 to the same release as the RMA Node 0.
Solution:
to avoid the split brain condition, set both sides of the control-link either as tagged or untagged, by using the following command via the CLI:
root> set chassis cluster control-link-vlan enable/disable
The following command output is an example of enabling control-link-vlan on Node 0 or disabling it on Node 1:
{primary:node0}
root@node0> set chassis cluster control-link-vlan enable
warning: A reboot is required for control-link-vlan to be enabled

{primary:node0}
root@node0> request system reboot
Reboot the system ? [yes,no] (no) yes
or
{primary:node1}
root@node1> set chassis cluster control-link-vlan disable
warning: A reboot is required for control-link-vlan to be disabled

{primary:node1}
root@node1> request system reboot
Reboot the system ? [yes,no] (no) yes

As the above command indicates, you will need a reboot after the change. After the reboot, the nodes will be in a cluster with correct priorities. The following output is after the control port tagging are enabled on both of the nodes:
{primary:node0}
root@node0> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual failover

Redundancy group: 0 , Failover count: 1
node0 200 primary no no
node1 100 secondary no no

Redundancy group: 1 , Failover count: 1
node0 200 primary no no
node1 100 secondary no no

{primary:node0}
root@node0> show chassis cluster information detail
node0:
-------------------------------------------------
Control link statistics:
Control link 0:
Heartbeat packets sent: 623
Heartbeat packets received: 600
Heartbeat packet errors: 0
Duplicate heartbeat packets received: 0
Control recovery packet count: 0
Sequence number of last heartbeat packet sent: 617
Sequence number of last heartbeat packet received: 637
Fabric link statistics:
Probes sent: 617
Probes received: 522
Probe errors: 0
Probes not processed: 254
Probes dropped due to control link down: 0
Probes dropped due to fabric link down: 3
Sequence number of last probe sent: 617
Sequence number of last probe received: 637
Chassis cluster LED information:
Current LED color: Green
Last LED change reason: No failures
Control port tagging:
Disabled

node1:
-------------------------------------------
Control link statistics:
Control link 0:
Heartbeat packets sent: 639
Heartbeat packets received: 603
Heartbeat packet errors: 0
Duplicate heartbeat packets received: 0
Control recovery packet count: 0
Sequence number of last heartbeat packet sent: 637
Sequence number of last heartbeat packet received: 617
Fabric link statistics:
Probes sent: 637
Probes received: 520
Probe errors: 0
Probes not processed: 253
Probes dropped due to control link down: 0
Probes dropped due to fabric link down: 3
Sequence number of last probe sent: 637
Sequence number of last probe received: 617
Chassis cluster LED information:
Current LED color: Green
Last LED change reason: No failures
Control port tagging:
Disabled

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search