The following article covers a step-by-step troubleshooting guide to the new HA High-Availability in NSM 2007.2 and above. The new HA replication introduces a new database layer real-time replication between both HA nodes combined with a traditional RSYNC/SSH replication for non-db related files.
Problem or Goal:
The steps below are listed in the order that events occurs on the HA system. This will help determine at which exact step the problem occurs.
Verify HA status for proper HA heartbeat communication and database replication status.
Check both highAvaila and highAvailSvr services are started correctly. A timed-out (error) indicates the services are not started on the other node or there is a heartbeat communication failure.
Observe database replication status of "dirty" or "in-sync" for later steps
Check HA configuration parameters of HaSvr/var/haSvr.cfg
In /usr/netscreen/HaSvr/var/haSvr.cfg, highAvail.isHaEnabled should be "y" on both nodes
Confirm the IP address configuration is correct for primary/secondary
Check highAvail.HMACKey matches on both nodes; for shared secret and heartbeat to work correctly
Verify the presence and functionality of SSH and RSYNC binaries on the system
Confirm the pingable IP is reachable by the local server; this is required for services to start
Note down highAvail.rsyncUser for later verification
Recommendation: set "highAvail.writetoLog" to "y" for better troubleshooting
Optional: verify shared disk configuration
Confirm both HA servers are running with the same root or non-root user id for the NSM processes
Verify guiSvr.setuid in configuration file for /usr/netscreen/GuiSvr/var/guiSvr.cfg If non-root operation is needed, run /usr/netscreen/GuiSvr/utils/setperms.sh
Verify devSvr.setuid in configuration file for /usr/netscreen/DevSvr/var/devSvr.cfg If non-root operation is needed, run /usr/netscreen/DevSvr/utils/setperms.sh
Verify guiSvr.setuid in configuration file for /usr/netscreen/HaSvr/var/haSvr.cfg If non-root operation is needed, run /usr/netscreen/HaSvr/utils/setperms.sh
Test the SSH trust relationship between both servers
From primary server SSH to the secondary server while logged into the account defined for "highavail.rsyncUser". The login should not prompt for a password if the SSH keys are correctly installed. Repeat the process from secondary server to primary server.
If the trust relation is not working, generate new RSA keys using ssh-keygen -t rsa and copy the id_rsa.pub to ~/.ssh/authorized_keys on the remote server. Repeat process for each node.
Verify initial RSYNC replication for non-database files is working correctly
Check the content of /usr/netscreen/HaSvr/bin/.haDoDirect.result for a "SUCCESS" or "FAIL" message. FAIL indicates the SSH trust relationship or RSYNC has failed due to missing SSH keys or OS permissions for NSM user setup.
With the remote node HA process stopped, HaSvr/bin/.haDoDirect can be run manually under nsm user to test SSH / RSYNC replication of non-database files to the remote server.
Check the HaSvr/var/errorLog/ha.log file for SSH/RSYNC errors
Optionally, if unable to identify the failure, run the SSH/RSYNC command from the ha.log file directly
From this point on, the HA heartbeat and non-database files using SSH/RSYNC should be functional. The next step covers the troubleshooting of the database layer replication in the order of events occurring for a successful replication. Verify primary and secondary guiSvrManager local HA link with highAvail process for proper startup.
Verify GuiSvr/var/errorLog/guiDaemon.0 contains the statement "connectionMgr connected to ha-link". This indicates a correct guiSvr local connection to highAvail process for controlling failover and database replication. (debug mode only)
If the message "getXdbObjById(domainId(0), category(shadow_server), id(2), snapshotId(0)) failed" is shown, the database container shadow_server is missing the record for ID 2 of type peerGuiSvr with the one time password. An extended HA environment would require DevSvr installation to insert the database record for shadow_server ID 2.
Confirm guiDaemon.0 log correctly reports the local IP and peer IP from haSvr.cfg as follow:
Check if the role is primary server or secondary server
If only the primary server is running (no other server running), the server should take the role of the master with "get BDB event DB_EVENT_REP_MASTER" shown in the guiDaemon.0 log file.
In the case of secondary server status, guiDaemon.0 will show get BDB event DB_EVENT_REP_CLIENT indicating it has received the role of a secondary server from highAvail process.
Verify secondary guiSvrManager startup and replication status
The get BDB event DB_EVENT_REP_NEWMASTER is shown on secondary when successfully connected to the primary server and the database replication is starting. When replication is complete, get BDB event DB_EVENT_REP_STARTUPDONE will be displayed.
Note: In the case of a first time replication or where database have diverged and need to be re-sync, this operation could take some time depending on the network speed and database size.
No progress is displayed in the guiDaemon.0 log file, but files in the guiSvr /var/netscreen/GuiSvr/xdb/data should be created and increasing as containers are being replicated from the primary server. Monitor with "ls -ltr" to sort by last touched.
Monitor transaction log files in GuiSvr/xdb/log on both primary/secondary, if the files are the same, most likely the replication is complete. Login to a GUI client and the DB_EVEN_STARTUPDONE should appear on the secondary if completed.
If unable to correctly replicate database or DB_PANIC is displayed on secondary guiDaemon log, then check the following:
If the database is unable to sync back with the primary, normally an automatic process will be started to re-sync. This is done by the .haDoDirect script at startup time. However, if this does not happen a manual restart may need to be performed.
To restart the process:
Stop the HaSvr process on the secondary
Delete every file except DB_CONFIG in /var/netscreen/GuiSvr/xdb/data. Delete files in GuiSvr/xdb/init and GuiSvr/xdb/log making sure to leave the directories empty.
This will force the server to re-sync a new database from the primary during the next startup
Try to start the HaSvr process again and observe if replication is occurring.
Ensure proper failover usage is followed in order to avoid diverged databases
Note: DB_PANIC message is shown in GuiSvr/var/errorLog/guiDaemon.0 on the standby-node.
When starting and stopping HA services, if wanting to give another server the primary role, ensure that this happens through a normal failover where both servers HA process are running; then stop HA services on the primary.
If starting/stopping services to let a node become a new primary server without an HA failover, the database will build a diverged log history from that of the other node. The log will no longer be able to sync after the other node is started in stand-by mode.
Check for ourRsaPrivateKey message in guiDaemon.0 on the secondary server.
If you see a message similar to:
ourRsaPrivateKey is missing in /usr/netscreen/GuiSvr/var/guiSvr.cfg! being repeated in /var/netscreen/GuiSvr/errorLog/guiDaemon.0 on the standby system, the clientOneTimePassword in guiSvr.cfg on the standby does not match what is in the database on the active NSM server.
1. Stop both NSM servers
2. Check the value in the NSM database on the Active server:
b. Open in read only mode.
c. Select Option 4
d. Type: 0.shadow_server.2
e. View and make note of the client one-time password in the shadow_server table. For example:
3. Change guiSvr.cfg on both systems to match what is shown in the NSM database.
4. Restart NSM processes (Active system, then standby)