This article provides a step-by-step troubleshooting guide for High Availability (HA) in NSM 2007.2 and above. The HA replication introduces a new database-layer real-time replication between both HA nodes, combined with a traditional RSYNC/SSH replication for non-db related files.
Problem or Goal:
The goal is to help users troubleshoot HA on NSM.
The steps below are listed in the order that events occur on the HA system. This will help determine the exact step at which the problem occurs.
Verify HA status for proper HA heartbeat communication and database replication status
Check that both the highAvaila and highAvailSvr services are started correctly. A timed-out (error) indicates that the services are not started on the other node or there is a heartbeat communication failure.
Observe database replication status of "dirty" or "in-sync" for later steps.
Check HA configuration parameters of HaSvr/var/haSvr.cfg
In /usr/netscreen/HaSvr/var/haSvr.cfg, the highAvail.isHaEnabled should be y on both nodes
Confirm that the IP address configuration is correct for the primary/secondary server.
Check that highAvail.HMACKey matches on both nodes (for shared secret and heartbeat to work correctly).
Verify the presence and functionality of the SSH and RSYNC binaries on the system.
Confirm that the pingable IP is reachable by the local server (this is required for services to start).
Note down highAvail.rsyncUser for later verification.
Recommendation: Set highAvail.writetoLog to y for better troubleshooting.
Optional: Verify the shared disk configuration.
Confirm that both HA servers are running with the same root or non-root user ID for the NSM processes
Verify guiSvr.setuid in the configuration file for /usr/netscreen/GuiSvr/var/guiSvr.cfg. If non-root operation is needed, run /usr/netscreen/GuiSvr/utils/setperms.sh.
Verify devSvr.setuid in the configuration file for /usr/netscreen/DevSvr/var/devSvr.cfg. If non-root operation is needed, run /usr/netscreen/DevSvr/utils/setperms.sh.
Verify guiSvr.setuid in the configuration file for /usr/netscreen/HaSvr/var/haSvr.cfg. If non-root operation is needed, run /usr/netscreen/HaSvr/utils/setperms.sh.
Test the SSH trust relationship between both servers
From the primary server, SSH to the secondary server while logged into the account defined for highavail.rsyncUser. The login should not prompt for a password if the SSH keys are correctly installed. Repeat this process from the secondary server to the primary server.
If the trust relation is not working, generate new RSA keys using ssh-keygen -t rsa and copy the id_rsa.pub to ~/.ssh/authorized_keys on the remote server. Repeat this process for each node.
Verify that initial RSYNC replication for non-database files is working correctly
Check the content of /usr/netscreen/HaSvr/bin/.haDoDirect.result for a SUCCESS or FAIL message. FAIL indicates the SSH trust relationship or RSYNC has failed due to missing SSH keys or OS permissions for NSM user setup.
With the remote-node HA process stopped, HaSvr/bin/.haDoDirect can be run manually under NSM user to test SSH/RSYNC replication of non-database files to the remote server.
Check the HaSvr/var/errorLog/ha.log file for SSH/RSYNC errors.
Optionally, if you are unable to identify the failure, run the SSH/RSYNC command directly from the ha.log file.
From this point on: The HA heartbeat and non-database files using SSH/RSYNC should be functional. The next step covers the troubleshooting of the database layer replication in the order of events occurring for a successful replication. Verify primary and secondary guiSvrManager local HA link with highAvail process for proper startup
Verify that the GuiSvr/var/errorLog/guiDaemon.0 contains the statement connectionMgr connected to ha-link. This indicates that there is a correct guiSvr local connection to the highAvail process for controlling failover and database replication. (debug mode only)
If the message getXdbObjById(domainId(0), category(shadow_server), id(2), snapshotId(0)) failed is shown, the database container shadow_server is missing the record for ID 2 of type peerGuiSvr with the one-time password. An extended HA environment would require DevSvr installation to insert the database record for shadow_server ID 2.
Confirm that the guiDaemon.0 log correctly reports the local IP and peer IP from haSvr.cfg, as follows:
Check if the role is primary server or secondary server
If only the primary server is running (no other server is running), the server should take the role of the master with get BDB event DB_EVENT_REP_MASTER shown in the guiDaemon.0 log file.
In the case of secondary server status, guiDaemon.0 will show get BDB event DB_EVENT_REP_CLIENT, indicating it has received the role of a secondary server from the highAvail process.
Verify secondary guiSvrManager startup and replication status
The get BDB event DB_EVENT_REP_NEWMASTER is shown on the secondary server when it is successfully connected to the primary server and the database replication is starting. When replication is complete, get BDB event DB_EVENT_REP_STARTUPDONE will be displayed. Note: In the case of a first-time replication or when databases have diverged and need to be re-synchronized, this operation could take some time depending on the network speed and database size.
No progress is displayed in the guiDaemon.0 log file, but files in the guiSvr /var/netscreen/GuiSvr/xdb/data should be created and increasing as containers are replicated from the primary server. Monitor with ls -ltr to sort by last-touched.
Monitor transaction log files in GuiSvr/xdb/log on both the primary/secondary server. If the files are the same, most likely the replication is complete. Log into a GUI client; DB_EVENT_STARTUPDONE should appear on the secondary server if the replication is completed.
If unable to correctly replicate the database or DB_PANIC is displayed on the secondary guiDaemon log
If the database is unable to sync back with the primary, an automatic process is usually started to re-sync. This is done by the .haDoDirect script at startup. However, if the automatic process to re-sync is not started, a manual restart may need to be performed.
To restart the process:
Stop the HaSvr process on the secondary server.
Delete every file except DB_CONFIG in /var/netscreen/GuiSvr/xdb/data. Delete files in GuiSvr/xdb/init and GuiSvr/xdb/log, making sure to leave the directories empty. This will force the server to re-sync a new database from the primary server during the next startup.
Try to start the HaSvr process again and observe if replication is occurring.
Ensure proper failover usage is followed to avoid diverged databases
Note: DB_PANIC is shown in GuiSvr/var/errorLog/guiDaemon.0 on the standby-node.
Another server can be given the primary role, but only through a normal failover, when the HA process for both servers is running, and you stop the HA services on the primary.
If services are started/stopped to let a node become a new primary server without an HA failover, the database will build a diverged log history from the other node. The new log will no longer be able to sync after the other node is started in standby mode.
Check for ourRsaPrivateKey message in guiDaemon.0 on the secondary server
If you see a message similar toourRsaPrivateKey is missing in /usr/netscreen/GuiSvr/var/guiSvr.cfg! is repeated in /var/netscreen/GuiSvr/errorLog/guiDaemon.0 on the standby system, the clientOneTimePassword in guiSvr.cfg on the standby does not match what is in the database on the active NSM server.
1. Stop both NSM servers:
2. Check the value in the NSM database on the Active server:
b. Open in read only mode.
c. Select Option 4.
d. Enter: 0.shadow_server.2.
e. View and make note of the client one-time password in the shadow_server table. For example:
3. Change guiSvr.cfg on both systems to match what is shown in the NSM database.
4. Restart the NSM processes (Active system, then standby):