Support Support Downloads Knowledge Base Apex Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[SBR] How to start one half of a cluster when the two halves have become isolated from each other

0

0

Article ID: KB27037 KB Last Updated: 08 Mar 2017Version: 4.0
Summary:
This article provides information on how to start one half of a cluster, when the two halves have become isolated from each other.
Symptoms:
SBR Carrier Cluster Edition might encounter a split brain scenario, if the two halves of the cluster become isolated from each other due to external causes. In normal cases, one half of the cluster will shut itself down to avoid running the split brain scenario.

However, if the entire cluster happens to stop functioning and is unable to restart by using the normal startup scripts, the procedure in this article provides a way to recover half of the cluster to resume SBR functionality, until full connectivity between the nodes can be restored.
Cause:
 
Solution:
This solution requires that half of the cluster has network connectivity, but has lost the ability to communicate with the other half and all nodes are offline.  

Note: This procedure is used to restore functionality of only one half of the cluster. Do not use this method to initialize both sides, as this will cause major issues, when connectivity is restored.

The NDB cluster can use the --nowait-nodes switch in a startup command to initialize the cluster, without waiting for the full cluster to be initialized. The --nowait-nodes switch is followed by a comma separated list of the nodes that are unreachable.

This scenario consists of a four node cluster with two front end Management/SBR (sm) nodes and two back end Data (d) nodes. It is assumed that one front end and one back end node are unreachable:
  1. From the front end node, logon as hadm and issue the ndb_mgmd command to start the management node:
    -bash-3.00# su - hadm
    hadm@mzone-1:~> ndb_mgmd --ndb-nodeid=51 --nowait-nodes=52 --configdir=/opt/JNPRmysqld/data/
    MySQL Cluster Management Server mysql-5.1.56 ndb-7.1.15a
  2. From the data node that is reachable, issue the ndbd command with the proper connect string. Note that this is run as the root:
    -bash-3.00# /opt/JNPRmysql/install/bin/ndbd --ndb-nodeid=1 --connect-string=nodeid=1,
    10.17.14.223:5235,10.17.14.224:5235 --nowait-nodes=2
    2013-03-06 17:17:00 [ndbd] INFO -- Angel connected to '10.17.14.223:5235'
    2013-03-06 17:17:00 [ndbd] INFO -- Angel allocated nodeid: 1
  3.  From here, check if the nodes are connected and that the data node is in the active state. If the data node is starting, SBR will fail to start. From the management node:

    The following state will not allow SBR to start:
    hadm@mzone-1:~> ndb_mgm -e show
    Connected to Management Server at: 10.17.14.223:5235
    Cluster Configuration
    ---------------------
    [ndbd(NDB)] 2 node(s)
    id=1 @10.17.14.225 (mysql-5.1.56 ndb-7.1.15, starting, Nodegroup: 0, Master)
    id=2 (not connected, accepting connect from 10.17.14.226)
    
    [ndb_mgmd(MGM)] 2 node(s)
    id=51 @10.17.14.223 (mysql-5.1.56 ndb-7.1.15)
    id=52 (not connected, accepting connect from 10.17.14.224)
    
    [mysqld(API)] 6 node(s)
    id=61 (not connected, accepting connect from 10.17.14.223)
    id=62 (not connected, accepting connect from 10.17.14.224)
    id=100 (not connected, accepting connect from 10.17.14.223)
    id=101 (not connected, accepting connect from 10.17.14.224)
    id=201 (not connected, accepting connect from 10.17.14.223)
    id=202 (not connected, accepting connect from 10.17.14.224)
    The following state is good:
    hadm@mzone-1:~> ndb_mgm -e show
    Connected to Management Server at: 10.17.14.223:5235
    Cluster Configuration
    ---------------------
    [ndbd(NDB)] 2 node(s)
    id=1 @10.17.14.225 (mysql-5.1.56 ndb-7.1.15, Nodegroup: 0, Master)
    id=2 (not connected, accepting connect from 10.17.14.226)
    
    [ndb_mgmd(MGM)] 2 node(s)
    id=51 @10.17.14.223 (mysql-5.1.56 ndb-7.1.15)
    id=52 (not connected, accepting connect from 10.17.14.224)
    
    [mysqld(API)] 6 node(s)
    id=61 (not connected, accepting connect from 10.17.14.223)
    id=62 (not connected, accepting connect from 10.17.14.224)
    id=100 (not connected, accepting connect from 10.17.14.223)
    id=101 (not connected, accepting connect from 10.17.14.224)
    id=201 (not connected, accepting connect from 10.17.14.223)
    id=202 (not connected, accepting connect from 10.17.14.224)
    
    hadm@mzone-1:~>
  4. Start SBR as the root from the Radius directory:
    -bash-3.00# ./sbrd start
Verify if the cluster is running with 50% of the nodes being connected:
-bash-3.00# ./sbrd status

---------------------------------------------------------------------------
SBR 7.31.23883 cluster mzone{0s,2sm,0m,2d}
on SunOS 5.10 Generic_142900-04 node mzone-1(sm)
---------------------------------------------------------------------------

Connected to Management Server at: 10.17.14.223:5235

[ndbd(NDB)] 2 node(s)
id=1 @10.17.14.225 (mysql-5.1.56 ndb-7.1.15, Nodegroup: 0, Master)
id=2 (not connected, accepting connect from 10.17.14.226)

[ndb_mgmd(MGM)] 2 node(s)
id=51 @10.17.14.223 (mysql-5.1.56 ndb-7.1.15)
id=52 (not connected, accepting connect from 10.17.14.224)

[mysqld(API)] 6 node(s)
id=61 @10.17.14.223 (mysql-5.1.56 ndb-7.1.15)
id=62 (not connected, accepting connect from 10.17.14.224)
id=100 @10.17.14.223 (mysql-5.1.56 ndb-7.1.15)
id=101 (not connected, accepting connect from 10.17.14.224)

10.17.14.223.1646 Idle
10.17.14.223.1813 Idle
10.17.14.223.1645 Idle
10.17.14.223.1812 Idle
*.1813 *.* 0 0 49152 0 LISTEN
*.1812 *.* 0 0 49152 0 LISTEN

hadm 14193 /opt/JNPRmysql/install/bin/mysqld --basedir=/opt/JNPRmysql/install --datadir=/o
hadm 13951 ndb_mgmd --ndb-nodeid=51 --nowait-nodes=52 --configdir=/opt/JNPRmysqld/data/
hadm 14105 /bin/sh /opt/JNPRmysql/install/bin/mysqld_safe
root 14194 radius sbr.xml


-bash-3.00#
When the network connectivity between the two halves of the cluster is restored, you can start the remaining nodes with the normal startup scripts.
Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search