Support Support Downloads Knowledge Base Service Request Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[SRX] Node0 becomes RG0 primary after reboot, while node1 is already the primary

0

0

Article ID: KB30707 KB Last Updated: 20 Jul 2016Version: 2.0
Summary:

Normally, node0 should not become RG0 primary after reboot. The pre-emption could be configured for other RG1+ groups, but it does not and should not work for RG0. This article describes a scenario in which node0 can become the RG0 primary after it is rebooted on a chassis cluster.

Symptoms:

Progress of the issue

1. Node0 state is primary before node0 reboot.

root> show chassis cluster status 
<-snip->
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 1
node0  200      primary        no      no       None
node1  100      secondary      no      no       None           

Redundancy group: 1 , Failover count: 1
node0  200      primary        no      no       None           
node1  100      secondary      no      no       None           

2. Node0 is rebooted.

3. Node0 should be secondary state after node0 reboot, but has unexpectedly booted as primary state.

root> show chassis cluster status 
<-snip->
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 1
node0  200      primary        no      no       None           <<<<< node0 is primary
node1  100      secondary      no      no       None           

Redundancy group: 1 , Failover count: 1
node0  200      primary        no      no       None           
node1  100      secondary      no      no       None           

4. RG0 failover occurred during booting stage. The reason for failover on node0 shows "Control & Fabric links down". In the example below, node0 was rebooted at 14:03:32 but unexpected failover happened at 14:13:05.

root>show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: primary, Weight: 255

	Time            From           To             Reason
        May 27 14:12:17 hold           secondary      Hold timer expired
        May 27 14:13:05 secondary      primary        Control & Fabric links down
<-snip-> node1: -------------------------------------------------------------------------- Redundancy Group Information: Redundancy Group 0 , Current State: secondary-hold, Weight: 255 Time From To Reason May 27 13:52:09 hold secondary Hold timer expired May 27 14:03:32 secondary primary Control & Fabric links down <<< node0 was rebooted May 27 14:13:05 primary secondary-hold Preempt/yield(100/200) <-snip->
Cause:

Node0 becomes primary after reboot because split brain (caused by inconsistencies originating from the maintenance of two separate data sets with overlap in scope) occurred in booting stage.

Node0 JSRPD

May 27 14:12:17 RG-0 hold timer, HOLD->SECONDARY
*/node0 booted and the state went to secondary

May 27 14:13:05 control link status changed from UP->DOWN
May 27 14:13:05 RG-0 secondary->PRIMARY due to Control & Fabric links down

*/node0 went to primary state due to control link down and node1 was also primary at this time. => split brain

May 27 14:13:08 control link status changed from DOWN->UP
*/node0 state remained primary because RG0 priority was higher than node1.

Node1 JSRPD

May 27 14:03:32 RG-0 secondary->PRIMARY due to Control & Fabric links down
*/node0 was rebooted and node1 state went to primary.

May 27 14:13:05 control link status changed from UP->DOWN
*/Split brain happened. Cluster state of both nodes were primary at this time

May 27 14:13:08 control link status changed from DOWN->UP
May 27 14:13:08 Both the nodes are primary. RG-0 PRIMARY->SECONDARY_HOLD due to preempt/yield, my priority 100 is worse than other node's priority 200

*/Control link went up and node1 detects node0 was also primary then the state changed to "secondary-hold" because RG0 priority was lower than node0
Solution:
Two approaches to a solution are possible:
If neither of these approaches works, open a case with JTAC to investigate the possibility that:
  •   This could be a hardware issue if the file system corruption is unresolved even with "-partition" option re-imaging. An RMA would be helpful for resolving a hardware cause.
  •   This could be a software issue: It could be a new software issue that needs further investigation. (The steps in the Problem above would be helpful for reproducing the problem for further investigation).



Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Security Alerts and Vulnerabilities

Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search