Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[MX] Preventing chassis fabric degradation by triggering early fabric healing

0

0

Article ID: KB33743 KB Last Updated: 22 Jun 2021Version: 2.0
Summary:

When fabric boards go down on MX Series routers, the corresponding fabric planes will go down. If there is not much traffic on the router's fabric, then Fabric Healing (FH) will start only after the last planes go offline. And when the last fabric plane too goes down, it creates a fabric blackhole in the system, which can impact services.

This article points out a knob that can be used to start Fabric Healing much earlier by defining a fabric percentage, which will, in turn, protect the system from such fabric blackholes.

For more information about fabric blackholes, see Fabric Plane Management.

Symptoms:

Fabric healing does not start until the last fabric plane goes offline when there is very little traffic on the fabric. And when the last fabric plane goes down, there will be a fabric blackhole, which can cause a service impact for end users.

Cause:

Fabric boards and planes can go down due to voltage /power issues, planes being offlined by users, and other similar causes.

Solution:

Rather than waiting for the fabric blackhole to occur and for actions to be triggered, the following configuration can be used to trigger those actions when the configured degradation threshold is reached. The actions are the same as blackhole actions.

labroot@router-re0# show chassis fabric
degraded {
    action-on-non-blackhole-degradation 20;   <<<<< Here we used 20%, but you can use any required value.
}

After configuring the above knob, in this example, we manually offlined the fabric planes to reach the <20% value. The Fabric Healing process started when the 20% threshold was reached and the planes were seen to come online as shown below.

In the following command output, you can see the current degraded % and the decrease in the percentage value as and when the planes come up.

labroot@router-re0# run show chassis fabric degradation
                 Reqd/Curr  Configured          Current     Time Last
FPC    State     Planes     Degrad(%),action    Degrad(%)   Action Initiated
0      Online    21/15      n/a,none            28
1      Online    21/15      n/a,none            28
2      Online    21/15      n/a,none            28
3      Online    21/15      n/a,none            28
4      Empty 
5      Empty 
6      Empty 
7      Empty 
8      Empty 
9      Online     21/15     n/a,none            28
10     Online     21/15     n/a,none            28
11     Online     21/15     n/a,none            28
12     Online     21/15     n/a,none            28
13     Online     21/15     n/a,none            28
14     Empty 
15     Empty 
16     Empty 
17     Empty 
18     Empty 
19     Empty

After configuring the above knob, you can check fabric reachability by using the following command:

labroot@router-re0# run show chassis fabric reachability

Fabric reachability status: Fabric degradation detected, action in progress
        Detected on                         : 2021-04-08 20:58:07 PDT
        Reason                              : Fabric Degradation due to grant timeouts seen by DPCs, MPCs, or FPCs

Fabric reachability action:
    Fabric reachability action              : Plane action
    Current phase                           : Plane Restart Phase is in progress
    Action started                          : 2021-04-08 20:58:17 PDT
 

The following output shows log messages that are seen after the Fabric Healing process has started and the fabric planes start to come up.

Apr  8 20:58:07.393  router-re0 chassisd[7277]: CHASSISD_FM_FABRIC_DEGRADED: DPCs are seeing grant timeouts; System is blackholing   Need to attempt fabric healing.  Action will be taken after 10 seconds, to address the fabric down condition. 
Apr  8 20:58:17.402  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 0 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.411  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 1 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.420  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 2 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.429  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 3 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.437  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 9 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.446  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 10 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.482  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 11 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.499  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 12 offline initiated  to attempt healing of the fabric down condition. 
Apr  8 20:58:17.522  router-re0 chassisd[7277]: CHASSISD_FM_ACTION_FPC_OFFLINE: FPC 13 offline initiated  to attempt healing of the fabric down condition.
Modification History:

2021-06-22: Updated the outputs for consistency and made other minor changes

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search