Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[EOL/EOE] [NSM] High Availability (HA) Troubleshooting for NSM 2007.2 and above



Article ID: KB11264 KB Last Updated: 24 Mar 2021Version: 10.0

This article provides a step-by-step troubleshooting guide for High Availability (HA) in NSM 2007.2 and above. The HA replication introduces a new database-layer real-time replication between both HA nodes, combined with a traditional RSYNC/SSH replication for non-db related files.

Note: A product listed in this article has either reached hardware End of Life (EOL) OR software End of Engineering (EOE). 
Refer to End of Life Products & Milestones for the EOL, EOE, and End of Support (EOS) dates.

The goal is to help users troubleshoot HA on NSM.


The steps below are listed in the order that events occur on the HA system. This will help determine the exact step at which the problem occurs.

1. Verify HA status for proper HA heartbeat communication and database replication status

  1. Run /usr/ntscreen/HaSvr/utils/haStatus.

  2. Check that both the highAvaila and highAvailSvr services are started correctly. A timed-out (error) indicates that the services are not started on the other node or there is a heartbeat communication failure. 

  3. Observe database replication status of "dirty" or "in-sync" for later steps.

2. Check HA configuration parameters of HaSvr/var/haSvr.cfg

  1. In /usr/netscreen/HaSvr/var/haSvr.cfg, the highAvail.isHaEnabled should be y on both nodes

  2. Confirm that the IP address configuration is correct for the primary/secondary server.

  3. Check that highAvail.HMACKey matches on both nodes (for shared secret and heartbeat to work correctly).

  4. Verify the presence and functionality of the SSH and RSYNC binaries on the system.

  5. Confirm that the pingable IP is reachable by the local server (this is required for services to start).

  6. Note down highAvail.rsyncUser for later verification.

  7. Recommendation: Set highAvail.writetoLog to y for better troubleshooting.

  8. Optional: Verify the shared disk configuration.

 3. Confirm that both HA servers are running with the same root or non-root user ID for the NSM processes

  1. Verify guiSvr.setuid in the configuration file for /usr/netscreen/GuiSvr/var/guiSvr.cfg.
    If non-root operation is needed, run /usr/netscreen/GuiSvr/utils/

  2. Verify devSvr.setuid in the configuration file for /usr/netscreen/DevSvr/var/devSvr.cfg.
    If non-root operation is needed, run /usr/netscreen/DevSvr/utils/

  3. Verify guiSvr.setuid in the configuration file for /usr/netscreen/HaSvr/var/haSvr.cfg.
    If non-root operation is needed, run /usr/netscreen/HaSvr/utils/

4. Test the SSH trust relationship between both servers

From the primary server, SSH to the secondary server while logged into the account defined for highavail.rsyncUser. The login should not prompt for a password if the SSH keys are correctly installed. Repeat this process from the secondary server to the primary server.

If the trust relation is not working, generate new RSA keys using ssh-keygen -t rsa and copy the to ~/.ssh/authorized_keys on the remote server. Repeat this process for each node.

5. Verify that initial RSYNC replication for non-database files is working correctly

  1. Check the content of /usr/netscreen/HaSvr/bin/.haDoDirect.result for a SUCCESS or FAIL message. FAIL indicates the SSH trust relationship or RSYNC has failed due to missing SSH keys or OS permissions for NSM user setup.

    With the remote-node HA process stopped, HaSvr/bin/.haDoDirect can be run manually under NSM user to test SSH/RSYNC replication of non-database files to the remote server.

  2. Check the HaSvr/var/errorLog/ha.log file for SSH/RSYNC errors.

  3. Optionally, if you are unable to identify the failure, run the SSH/RSYNC command directly from the ha.log file.

  4. From this point on: The HA heartbeat and non-database files using SSH/RSYNC should be functional. The next step covers the troubleshooting of the database layer replication in the order of events occurring for a successful replication.

6. Verify primary and secondary guiSvrManager local HA link with highAvail process for proper startup

  1. Verify that the GuiSvr/var/errorLog/guiDaemon.0 contains the statement connectionMgr connected to ha-link. This indicates that there is a correct guiSvr local connection to the highAvail process for controlling failover and database replication. (debug mode only)

    If the message getXdbObjById(domainId(0), category(shadow_server), id(2), snapshotId(0)) failed is shown, the database container shadow_server is missing the record for ID 2 of type peerGuiSvr with the one-time password. An extended HA environment would require DevSvr installation to insert the database record for shadow_server ID 2.

  2. Confirm that the guiDaemon.0 log correctly reports the local IP and peer IP from haSvr.cfg, as follows:
    read highAvail.heartbeatPort; myAddr =
    read highAvail.heartbeatPort; peerAddr =

7. Check if the role is primary server or secondary server

If only the primary server is running (no other server is running), the server should take the role of the primary with get BDB event DB_EVENT_REP_MASTER shown in the guiDaemon.0 log file.

In the case of secondary server status, guiDaemon.0 will show get BDB event DB_EVENT_REP_CLIENT, indicating it has received the role of a secondary server from the highAvail process.

8. Verify secondary guiSvrManager startup and replication status

The get BDB event DB_EVENT_REP_NEWMASTER is shown on the secondary server when it is successfully connected to the primary server and the database replication is starting. When replication is complete, get BDB event DB_EVENT_REP_STARTUPDONE will be displayed. Note: In the case of a first-time replication or when databases have diverged and need to be re-synchronized, this operation could take some time depending on the network speed and database size.
No progress is displayed in the guiDaemon.0 log file, but files in the guiSvr /var/netscreen/GuiSvr/xdb/data should be created and increasing as containers are replicated from the primary server. Monitor with ls -ltr to sort by last-touched.

Monitor transaction log files in GuiSvr/xdb/log on both the primary/secondary server. If the files are the same, most likely the replication is complete. Log into a GUI client; DB_EVENT_STARTUPDONE should appear on the secondary server if the replication is completed.

9. If unable to correctly replicate the database or DB_PANIC is displayed on the secondary guiDaemon log

If the database is unable to sync back with the primary, an automatic process is usually started to re-sync. This is done by the .haDoDirect script at startup. However, if the automatic process to re-sync is not started, a manual restart may need to be performed.

To restart the process:
  1. Stop the HaSvr process on the secondary server.

  2. Delete every file except DB_CONFIG in /var/netscreen/GuiSvr/xdb/data. Delete files in GuiSvr/xdb/init and GuiSvr/xdb/log, making sure to leave the directories empty. This will force the server to re-sync a new database from the primary server during the next startup.

  3. Try to start the HaSvr process again and observe if replication is occurring.

10. Ensure proper failover usage is followed to avoid diverged databases

Note: DB_PANIC is shown in GuiSvr/var/errorLog/guiDaemon.0 on the standby-node.

Another server can be given the primary role, but only through a normal failover, when the HA process for both servers is running, and you stop the HA services on the primary.

If services are started/stopped to let a node become a new primary server without an HA failover, the database will build a diverged log history from the other node. The new log will no longer be able to sync after the other node is started in standby mode.

11. Check for ourRsaPrivateKey message in guiDaemon.0 on the secondary server

If you see a message similar to ourRsaPrivateKey is missing in /usr/netscreen/GuiSvr/var/guiSvr.cfg! is repeated in /var/netscreen/GuiSvr/errorLog/guiDaemon.0 on the standby system, the clientOneTimePassword in guiSvr.cfg on the standby does not match what is in the database on the active NSM server.

1. Stop both NSM servers:
/etc/init.d/haSvr stop
2. Check the value in the NSM database on the Active server:
a. /usr/netscreen/GuiSvr/utils/

b. Open in read only mode.

c. Select Option 7.

d. Enter: 0.shadow_server.2.

e. View and make note of the client one-time password in the shadow_server table. For example:
:type (peerGuiSvr)

:clientOneTimePassword (netscreen)
3. Change guiSvr.cfg on both systems to match what is shown in the NSM database.

4. Restart the NSM processes (Active system, then standby):
/etc/init.d/haSvr start
Modification History:
2021-03-23: Updated the article terminology to align with Juniper's Inclusion & Diversity initiatives.
2019-09-20: Fixed Step 11.2.c.

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search