When fabric boards go down on MX Series routers, the corresponding fabric planes will go down. If there is not much traffic on the router's fabric, then Fabric Healing (FH) will start only after the last planes go offline. And when the last fabric plane too goes down, it creates a fabric blackhole in the system, which can impact services.
This article points out a knob that can be used to start Fabric Healing much earlier by defining a fabric percentage, which will, in turn, protect the system from such fabric blackholes.
For more information about fabric blackholes, see Fabric Plane Management.
Fabric healing does not start until the last fabric plane goes offline when there is very little traffic on the fabric. And when the last fabric plane goes down, there will be a fabric blackhole, which can cause a service impact for end users.
Fabric boards and planes can go down due to voltage /power issues, planes being offlined by users, and other similar causes.
Rather than waiting for the fabric blackhole to occur and for actions to be triggered, the following configuration can be used to trigger those actions when the configured degradation threshold is reached. The actions are the same as blackhole actions.
labroot@router-re0# show chassis fabric
degraded {
action-on-non-blackhole-degradation 20; <<<<< Here we used 20%, but you can use any required value.
}
After configuring the above knob, in this example, we manually offlined the fabric planes to reach the <20% value. The Fabric Healing process started when the 20% threshold was reached and the planes were seen to come online as shown below.
In the following command output, you can see the current degraded % and the decrease in the percentage value as and when the planes come up.
labroot@router-re0# run show chassis fabric degradation
Reqd/Curr Configured Current Time Last
FPC State Planes Degrad(%),action Degrad(%) Action Initiated
0 Online 21/15 n/a,none 28
1 Online 21/15 n/a,none 28
2 Online 21/15 n/a,none 28
3 Online 21/15 n/a,none 28
4 Empty
5 Empty
6 Empty
7 Empty
8 Empty
9 Online 21/15 n/a,none 28
10 Online 21/15 n/a,none 28
11 Online 21/15 n/a,none 28
12 Online 21/15 n/a,none 28
13 Online 21/15 n/a,none 28
14 Empty
15 Empty
16 Empty
17 Empty
18 Empty
19 Online 21/15 n/a,none 28
After configuring the above knob, you can check fabric reachability by using the following command:
labroot@router-re0# run show chassis fabric reachability
Fabric reachability status: Fabric degradation detected, action in progress
Detected on : 2018-10-08 07:18:51 PDT
Reason : Fabric Degradation due to grant timeouts seen by DPCs, MPCs, or FPCs
Fabric reachability action:
Fabric reachability action : Plane action
Current phase : Plane Restart Phase is in progress
Action started : 2018-10-08 07:19:01 PDT
The following output shows log messages that are seen after the Fabric Healing process has started and the fabric planes start to come up.
Oct 8 07:18:52.212 router-re0 chassisd[5988]: CHASSISD_FM_FABRIC_DEGRADED: DPCs are seeing grant timeouts; System is blackholing Need to attempt fabric healing. Action will be taken after 10 seconds, to address the fabric down condition.
Oct 8 07:19:02.242 router-re0 chassisd[5988]: CHASSISD_FM_ACTION_PLANE_ONLINE: Fabric plane 0 online initiated to attempt healing of the fabric down condition.
Oct 8 07:21:14.386 router-re0 chassisd[5988]: CHASSISD_FM_ACTION_PLANE_ONLINE: Fabric plane 1 online initiated to attempt healing of the fabric down condition.
Oct 8 07:21:21.725 router-re0 chassisd[5988]: CHASSISD_FM_ACTION_PLANE_ONLINE: Fabric plane 2 online initiated to attempt healing of the fabric down condition.
Oct 8 07:21:29.152 router-re0 chassisd[5988]: CHASSISD_FM_ACTION_PLANE_ONLINE: Fabric plane 3 online initiated to attempt healing of the fabric down condition.