Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[MX] Destination timeouts to one FPC may cause all FPCs to go offline

0

0

Article ID: KB36457 KB Last Updated: 05 Feb 2021Version: 1.0
Summary:

Destination timeouts to one FPC can cause all FPCs to go offline on the MX 2020 chassis.

In specific MPC hardware failure conditions within the MX2K platform, fabric healing will attempt to auto-heal the fault location in 3 phases to prevent traffic blackholing. If under such fault conditions only destination timeouts are reported without corresponding link errors, the fabric healing process might restart all MPCs in phase-2 in an auto-healing attempt and if the error condition appears again within 10 minutes the last phase-2 might offline all MPCs in the system.

Symptoms:
  • Traffic blackholing or degraded capacity seen for the device.
  • Fabric destination timeouts reported.
  • Multiple fabric planes go into check state simultaneously.
Cause:

Whenever we have fabric timeouts detected for a destination PFE, the default recovery mechanism used by the chassis is automatic fabric healing to prevent traffic blackholing and offline the bad component.
This is beneficial as the detection and recovery action taken by the chassis is automatic without the need of manual intervention. It does prevent a lot of outages.
However, in certain cases, if the chassisd does not have proof enough of who the bad component is, it might take the extreme action of offlining all the FPCs on the chassis.
In order to understand the reason behind this extreme measure, we need to understand the automatic fabric healing mechanism.

Prior to the healing kicking in, the system needs to notice the degradation. This happens when the system notices fabric timeouts to one or multiple FPCs. Below log signatures are seen: 

fpc10 CMTFPC: Fabric request time out pfe 0 plane 8 dest 142, attempting recovery

Fabric requests are sent to the from the source PFE , forwarded by the fabric plane to the destination PFE to understand if it is ready to accept data. In response to these requests, grants are sent by the destination PFE, forwarded by the fabric plane to the source PFE, after which data is finally sent from src to dst PFE.

However, the src PFE cannot wait forever for the grants, which is where these timeouts are seen. The problem can be either with the fabric plane or the dest PFE. So, the system needs to check both.

In case a lot of these timeouts are seen for multiple source PFEs and single destination PFE, the system has a way of knowing the bad FPC. However, if the timeouts occur for multiple destinations, the chassis cannot decide on the bad component.

In the event of too many request timeouts, chassis fabric degradation is seen: 

chassisd[14234]: CHASSISD_FM_FABRIC_DEGRADED: DPCs are seeing grant timeouts; System is blackholing   Need to attempt fabric healing.  Action will be taken after 10 seconds, to address the fabric down condition.

Multiple fabric planes can go into check state as well: 

craftd[12005]:  Minor alarm set, Check plane 0 Fabric Chip
craftd[12005]:  Minor alarm set, Check plane 1 Fabric Chip
craftd[12005]:  Minor alarm set, Check plane 2 Fabric Chip
craftd[12005]:  Minor alarm set, Check plane 3 Fabric Chip
craftd[12005]:  Minor alarm set, Check plane 4 Fabric Chip
craftd[12005]:  Minor alarm set, Check plane 5 Fabric Chip
craftd[12005]:  Minor alarm set, Check plane 6 Fabric Chip
craftd[12005]:  Minor alarm set, Check plane 7 Fabric Chip​

These can be seen using the CLI command "show chassis alarm" as well.

In the 1st phase of healing, the fabric planes are offlined and onlined one by one: 

Sample log messages

Message logs: 

FH: fabric_mx_healing_phase1_stage: Starting Fabric heal phase 1 - Plane restart
FH: fabric_mx_healing_phase_plane_restart_stage: Starting Fabric heal phase 1 - Plane restart
FH: fabric_mx_healing_phase_end: End of Fabric heal phase 1

Chassisd logs: 

chassisd[11999]: CHASSISD_FM_ACTION_PLANE_ONLINE: Fabric plane 1 online initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_ACTION_PLANE_ONLINE: Fabric plane 2 online initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_ACTION_PLANE_OFFLINE: Fabric plane 3 offline initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_ACTION_PLANE_OFFLINE: Fabric plane 4 offline initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_ACTION_PLANE_OFFLINE: Fabric plane 5 offline initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_ACTION_PLANE_ONLINE: Fabric plane 3 online initiated  to attempt healing of the fabric down condition.

In the 2nd phase, the bad FPC is offlined/onlined, but in case there are more than one FPC for which destination timeouts are seen, the system cannot decide on the bad FPC: 

Message logs:

FH: fabric_mx_healing_phase2_stage: Starting Fabric heal phase 2
FH: fabric_mx_healing_phase2_fpc_offline_stage: Starting Fabric heal phase 2 - Offline DPCs
FH: fabric_mx_healing_phase2_fpc_online_stage: Starting Fabric heal phase 2 - Online DPCs
FH: fabric_mx_healing_phase_end: End of Fabric heal phase 2

Chassisd logs:

chassisd[11999]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 19 offline initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_ACTION_FPC_ONLINE: FPC 19 online initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_FABRIC_DEGRADED: DPCs are seeing grant timeouts; System is blackholing   Need to attempt fabric healing.  Action will be taken after 10 seconds, to address the fabric down condition.

More than one FPCs for which destination timeouts are seen: 

Jan  1 05:07:19  FH: fm_hsl2_get_bad_dpc: fpc 1 is marked faulty
Jan  1 05:07:19  send: red alarm set, device FPC 1, reason FPC 1 has unreachable destinations
Jan  1 05:07:19  FH: fm_hsl2_get_bad_dpc: More than one fpc is marked faulty. Cannot decide
Jan  1 05:07:19  send: red alarm set, device FPC 13, reason FPC 13 has unreachable destinations
Jan  1 05:07:19  FH: fm_hsl2_get_bad_dpc: More than one fpc is marked faulty. Cannot decide
Jan  1 05:07:19  send: red alarm set, device FPC 19, reason FPC 19 has unreachable destinations
Jan  1 05:07:19 CHASSISD_FM_FABRIC_DEGRADED: DPCs are seeing grant timeouts; System is blackholing Need to attempt fabric healing. Action will be taken after 10 seconds, to address the fabric down condition.

Start of phase 3: permanent offline of the FPCs:

Jan  8 05:53:33  FH: fabric_mx_healing_phase3_stage: Starting Fabric heal phase 3 - DPC Offline
Jan  8 05:53:33  FH: fabric_mx_healing_phase_end: End of Fabric heal phase 3

More than one FPCs offlined as system could not narrow down on the bad FPC: 

chassisd[11999]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 11 offline initiated  to attempt healing of the fabric down condition.
chassisd[11999]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 14 offline initiated  to attempt healing of the fabric down condition.

In order to online the FPCs after phase 3, manual intervention is required.

In some cases, the healing procedure could get stuck in phase 2: restart of FPCs. If the bad FPC which is restarted has a hardware failure, the system will try to bring it online, but it would be unresponsive. In this case, inadvertently, the FPC having the hardware issue prevents the chassisd to move forward with phase 3 of fabric healing and prevents the other working FPCs to go offline.

Sample logs in case where FPC was unresponsive during phase 2 of fabric healing:

Jan  1 04:51:15 CHASSISD_FRU_UNRESPONSIVE_RETRY: Attempt 3 to power on FPC 10 timed out; restarted it
Jan  1 04:51:15 fpc_tiny_take_offline: FPC 10 state=2
Jan  1 04:51:15 CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 10 offline: Restarting unresponsive board
Jan  1 04:51:15  fpc_down slot 0 reason Restarting unresponsive board cargs 0x0
Jan  1 04:51:15  fpc_offline_now - slot 10, reason: Restarting unresponsive board, error OK transition state 1
Jan  1 04:51:15 CHASSISD_SNMP_TRAP3: ENTITY trap generated: entStateOperDisabled (entPhysicalIndex 24, entStateAdmin 4, entStateAlarm 0)
Jan  1 04:51:15 CHASSISD_SNMP_TRAP0: ENTITY trap generated: entConfigChanged
Jan  1 04:51:15  fru_power_off_generic
Jan  1 04:51:15  fru_power_off_generic: calling fru_poweroff vector
Jan  1 04:51:15  FPC#10 - power off [addr 0xe] reason: Restarting unresponsive board
 
Jan  3 05:31:39  fpc_offline_now - slot 10, is_resync_ready cleared
Jan  3 05:31:39  hwdb: entry for fpc 3139 at slot 10 deleted
Jan  3 05:31:39  send: red alarm set, device FPC 0, reason FPC 10 Hard errors
Solution:
This issue is being tracked through PR1482124 - Fabric healing logic incorrectly makes all MPC line cards go offline in the MX2000 router while the hardware fault is located on one specific MPC line card slot.

In the long term, the Junos version needs to be upgraded on the device to one of the fixed version. For the short term workaround, the phase 3 of the fabric healing procedure can be disabled using the following CLI command: 
set chassis fpc  fabric blackhole-action offline

 
Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search