Support Support Downloads Knowledge Base Apex Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[QFX] 'Chassis Manager connection down' alarms observed and RE switchover fails with error 'RE host chassis manager connection unstable'

0

0

Article ID: KB35780 KB Last Updated: 30 Mar 2021Version: 3.0
Summary:

Post RE switchover, Chassis Manager connection down alarms are observed for the old primary and the RE fail back fails with the error: 'Command aborted, RE host chassis manager connection unstable'.

Symptoms:

An RE switchover is attempted from RE0 to RE1 and the switchover is successful. However, alarms are triggered on the new primary as shown below:

QFX10K8-RE1> show chassis alarms 
2 alarms currently active
Alarm time               Class  Description
2020-05-05 09:58:03 CDT
  Major  Host 0 : Chassis Manager connection down
2020-05-05 09:56:33 CDT
  Minor  Backup RE Active
 

And it fails even after multiple attempts with the following error message:

[MASTER]
QFX10K8-RE1> request chassis routing-engine master switch
Toggle mastership between routing engines ? [yes,no] (no) yes
error: Command aborted, RE host chassis manager connection unstable

 

Both the RE's are on the same Junos release. Multiple attempts to restart the RE0 or both the RE's did not resolve the issue. Replacing the RE's cleared the issue but re-occurred in a few days with the new RE's in place.

Cause:

The reason for the error, 'RE host chassis manager connection unstable' is because the LCMD process on the RE0 fails to launch after the RE switchover. This can be compared on both the RE's:

RE0 Host shell:

root@QFX10K8-RE0-node:~# ps aux | grep lcmd
ps aux | grep lcmd
root       465  0.0  0.0   4400   512 pts/2    S+   17:22   0:00 grep lcmd


RE1 Host shell:

root@QFX10K8-RE1-node:~# ps aux | grep lcmd
ps aux | grep lcmd
root     22615  0.0  0.0   4400   512 pts/2    S+   17:27   0:00 grep lcmd
root     32309  0.0  0.0  32088  7816 ?        Ssl  09:58   0:04 lcmd​  <-- This is the process missing on RE0.


Host logs on the RE0 indicates the lcmd is trying to restart but keeps failing even after multiple attempts.

root@QFX10K8--RE0-node:/var/log# zmore syslog*
2020-05-05T08:35:46.843553-05:00 QFX10K8-RE0-node monit[7561]: 'lcmd' failed to start
2020-05-05T08:35:48.849291-05:00 QFX10K8-RE0-node monit[7561]: 'lcmd' service restarted 4 times within 4 cycles(s) - exec
2020-05-05T08:35:48.849302-05:00 QFX10K8-RE0-node monit[7561]: 'lcmd' exec: /usr/sbin/monit_daemon_fail.sh
2020-05-05T08:35:48.849378-05:00 QFX10K8-RE0-node monit[7561]: 'lcmd' process is not running
2020-05-05T08:35:48.849284-05:00 QFX10K8-RE0-node monit[7561]: 'lcmd' trying to restart
2020-05-05T08:35:48.849312-05:00 QFX10K8-RE0-node monit[7561]: 'lcmd' start: /sbin/service


The reason for the LCMD process failing to launch is because of the integrity check failure of certain files on the RE0 host after the RE switchover. Once such issue is observed when you insert different type of FPC into a slot which was earlier used by another type of FPC after which an RE switchover is done.

Solution:

The issue is with the lcmd process which keeps trying to restart continuously. The workaround is to recover from the issue by performing a hypervisor reboot during a Maintenance window. However, this is expected to re-occur in the event a similar sequence of steps is performed; i.e. moving the different types of FPC's to slots which were earlier used for some other type of FPC and perform a RE switchover.

This issue is resolved in the below Junos versions and later releases:

  • junos:17.3R3-S6
  • junos:17.4R2-S9
  • junos:17.4R3
  • junos:18.1R3-S5
  • junos:18.2R1
  • junos:18.3R1
Modification History:
2021-03-25: Updated the article terminology to align with Juniper's Inclusion & Diversity initiatives
2020-07-22: Article reviewed for accuracy. Minor, non-technical edits.
 
Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search