Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[MX] FPC health check after fabric degradation due to grant timeouts on DPCs

0

0

Article ID: KB35847 KB Last Updated: 29 May 2020Version: 1.0
Summary:

Juniper routers and high-end switches have built-in resiliency to tackle failures and error conditions encountered during normal operation. Immediate action is taken by JUNOS software to remedy the failure conditions to minimize traffic loss. No manual intervention is needed. Fabric degradation could be one of the reasons leading to such error conditions. When the system detects any unreachable Packet Forwarding Engine destinations, fabric connectivity restoration is attempted. If restoration fails, the system turns off the interfaces to trigger local protection action or traffic re-route on the adjacent routers.

The recovery process to re-establish fabric connectivity consists of the following phases:

  1. Fabric plane restart phase.
  2. Fabric plane and line card restart phase.
  3. Line card offline phase.

This article discusses what health checks to follow once FPC Fabric healing has happened on a DPC in MX series devices.

Symptoms:
  1. FPC restarts/ reboots on its own.
  2. FPC offline and does not come up.

The following error message can be seen on the device:

Feb 20 05:29:03   fpc11 CMTFPC: Fabric request time out pfe 3 plane 1 dest 117, attempting recovery
Feb 20 05:29:03   fpc8 CMTFPC: Fabric request time out pfe 1 plane 1 dest 116, attempting recovery
Feb 20 05:29:03   chassisd[5827]: CHASSISD_FM_FABRIC_DEGRADED: DPCs are seeing grant timeouts; System is blackholing   Need to attempt fabric healing.  Action will be  taken after 10 seconds, to address the fabric down condition.
Feb 20 05:29:03   alarmd[5923]: Alarm set: FPC color=RED, class=CHASSIS, reason=FPC 9 has unreachable destinations
Feb 20 05:29:03   craftd[5830]:  Major alarm set, FPC 9 has unreachable destinations
<snip>
Feb 20 15:29:08   chassisd[5827]: CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 23, jnxFruL1Index 10, jnxFruL2Index 0, jnxFruL3Index 0,  jnxFruName ADC 9, jnxFruType 20, jnxFruSlot 9, jnxFruOfflineReason 2, jnxFruLastPowerOff 0, jnxFruLastPowerOn 141463817)
Feb 20 15:29:08   kernel: GENCFG: op 32 (Resync blob) failed; err 7 (Doesn't Exist)
Feb 20 15:29:08   last message repeated 8 times
Feb 20 15:29:08   chassisd[5827]: CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 7, jnxFruL1Index 10, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName  FPC: MPC5E 3D 24XGE+6XLGE @ 9/*/*, jnxFruType 3, jnxFruSlot 9, jnxFruOfflineReason 2, jnxFruLastPowerOff 141461675, jnxFruLastPowerOn 141463892)
Feb 20 15:29:11  clr1-a-gdc.bf1 chassisd[5827]: CHASSISD_FM_ACTION_FPC_ONLINE: FPC 9 online initiated  to attempt healing of the fabric down condition. 
<snip>
Feb 20 15:31:42   alarmd[5923]: Alarm cleared: FPC color=RED, class=CHASSIS, reason=FPC 9 offlined due to unreachable destinations
Feb 20 15:31:42   craftd[5830]: Major alarm cleared, FPC 9 offlined due to unreachable destinations
Feb 20 15:31:42   alarmd[5923]: Alarm cleared: FPC color=RED, class=CHASSIS, reason=FPC 9 has unreachable destinations
Feb 20 15:31:42   alarmd[5923]: Alarm cleared: FPC color=RED, class=CHASSIS, reason=FPC 9 Major Errors
Feb 20 15:31:42   chassisd[5827]: Use of MPC in IP services mode requires L3 bundle or license
Feb 20 15:31:42   craftd[5830]: Major alarm cleared, FPC 9 has unreachable destinations
Feb 20 15:31:42   craftd[5830]: Major alarm cleared, FPC 9 Major Errors
<snip>
Feb 20 16:08:32   chassisd[5827]: fru_nmi_timer: Restart FPC 9 due to NMI timeout
Cause:
  1. An issue with Fabric connectivity from DPC.
  2. Fabric drops.
  3. Fabric errors triggered on a fabric plane.
  4. Unreachable destinations from FPC.
  5. Hardware issue with FPC.
Solution:


Check if there are active alarms on the device:

user@host> show chassis alarms

No alarms currently active  <-- Verifies that FPC alarm has cleared


Check if there are any core dumps on the device:

user@host> show system core-dumps no-forwarding


Check for CRC erros on the device: 

>request pfe execute command "show hsl2 statistics" target fpc 9 


Check if FPC is Online:

In case FPC is online, it means FPC has recovered after the healing process triggered internally by the software. Monitor the FPC health for 24-48 hours to prevent any such further activities.

 Slot 9 information:
  State                          Online
  Temperature                    30
  Total CPU DRAM                 3584 MB
  Total XR2                      259 MB
  Total DDR DRAM                 24960 MB
  Start time                     2020-02-20 16:09:26 UTC
  Uptime                         2 minutes, 38 seconds
  Max power consumption          496 Watts​


Check Fabric Reachability: 

A request-grant mechanism is used to implement the flow control between PFE and Fabric. Each PFE that wants to send a packet to a destination PFE sends a request via the fabric; only when the request is granted, is it allowed to send the packet to the destination PFE.

user@host> show chassis fabric reachability
Fabric reachability status: Fabric degradation condition healed  <--
In case the FPC has recovered, you can verify if the Fabric Reachability has also been achieved. 
Detected on                 : 2020-05-05 11:48:15 UTC
Reason                      : Fabric Degradation due to grant timeouts seen by DPCs  <--
Due to no grant allowed to send the packet to destination PFE, Fabric Degradation is suspected.
Fabric reachability action:
Fabric reachability action   : Plane and FPC action
Current phase                : Plane and FPC Restart Phase is completed
Action started               : 2020-05-05 11:48:25 UTC
Action completed             : 2020-05-05 11:53:27 UTC

Fabric reachability resolution: Fabric degradation healed after phase Plane and FPC restart​ <-- Verifies the reason for FPC restart and resolved the Fabric Reachability issue


As a preventive measure, you can configure an action in such a way that it triggers when a configuration threshold is reached. Refer to KB33743 - Preventing chassis fabric degradation by triggering early fabric healing


Check Fabric Drop Statistics: 

The following command gives total statistics destined to each FPC:

user@host>show class-of-service fabric statistics summary 
Destination FPC Index: 0, Source FPC Index: Summarized from all source FPC's
Total statistics:   High priority           Low priority
    Packets:             448626385             1365903575
    Bytes  :          598687578257          1861601724122
    Pps    :                    40                   3996
    Bps    :                 20912                5470976
 Tx statistics:      High priority           Low priority
    Packets:             448626385             1365903575
    Bytes  :          598687578257          1861601724122
    Pps    :                    40                   3996
    Bps    :                 20912                5470976

 Drop statistics:    High priority           Low priority
    Pps    :                     0                      0 <--
Verify the drop statistics are zero after healing
    Bps    :                     0                      0
Destination FPC Index: 1, Source FPC Index: Summarized from all source FPC's
...
Drop statistics:    High priority           Low priority
    Pps    :                     0                      0
    Bps    :                     0                      0
Destination FPC Index: 2, Source FPC Index: Summarized from all source FPC's
...
Drop statistics:    High priority           Low priority
    Pps    :                     0                      0
    Bps    :                     0                      0
Destination FPC Index: 3, Source FPC Index: Summarized from all source FPC's
....
Drop statistics:    High priority           Low priority
    Pps    :                     0                      0
    Bps    :                     0                      0
<snip>


The following command gives statistics from a specific source FPC to specific destination FPC (below statistics represent output for source FPC 3 and destination FPC 0):

user@host> show class-of-service fabric statistics destination 0 source 3 
Destination FPC Index: 0, Source FPC Index: 3
 Total statistics:   High priority           Low priority
    Packets:              42445974              439161456
    Bytes  :           71944097695           604147953760
    Pps    :                     0                   3995
    Bps    :                     0                5470681
 Tx statistics:      High priority           Low priorit
    Packets:              42445974              439161456
    Bytes  :           71944097695           604147953760
    Pps    :                     0                   3995
    Bps    :                     0                5470681
 Drop statistics:    High priority           Low priority  <-- Verify drop count
    Packets:                     0                      0
    Bytes  :                     0                      0
    Pps    :                     0                      0
    Bps    :                     0                      0


Check drops in Grant/Request sent and received:

user@host>show chassis fabric statistics 0 fpc totals <-- Ideally drops should be "0"
SF-chip statistics for DPC0PFE0
Received     :
--------------
Drops        : 0
Sent         :
--------------
Drops        : 0
SF-chip statistics for DPC0PFE1
Received     :
--------------
Drops        : 0
Sent         :
--------------
Drops        : 0
SF-chip statistics for DPC0PFE2
Received     :
--------------
Drops        : 0
Sent         :
--------------
Drops        : 0
SF-chip statistics for DPC0PFE3
Received     :                         
--------------
Drops        : 0
Sent         :
--------------
Drops        : 0
<…>
 

Verify Fabric In/Out drops:

user@host> show pfe statistics traffic
Packet Forwarding Engine traffic statistics:
    Input  packets:               102682                    5 pps
    Output packets:                58033                    4 pps
    Fabric Input  packets:             0                    0 pps
    Fabric Output packets:             0                    0 pps
Packet Forwarding Engine local traffic statistics:                  
   <snip>
Packet Forwarding Engine local protocol statistics:
    <snip>
Packet Forwarding Engine hardware discard statistics:
    Timeout                    :                    0
    Truncated key              :                    0   
    Bits to test               :                    0        
    Data error                 :                    0
    TCP header length error    :                    0
    Stack underflow            :                    0
    Stack overflow             :                    0
    Normal discard             :                    0
    Extended discard           :                    0
    Invalid interface          :                    0
    Info cell drops            :                    39
    Fabric drops               :                    0
Packet Forwarding Engine Input IPv4 Header Checksum Error and Output MTU Error statistics:
    Input Checksum             :                    0
    Output MTU                 :                    0


PFE traffic can also be verified for specific FPC and PFE:

user@host> show pfe statistics traffic detail fpc 4 pfe 0
Packet Forwarding Engine Details:
    fpc:                    4
    pfe:                    0
Packet Forwarding Engine traffic statistics:
    Input  packets:                   34                    2 pps
    Output packets:                   34                    2 pps
    Fabric Input  packets:             0                    0 pps
    Fabric Output packets:             0                    0 pps
Packet Forwarding Engine hardware discard statistics:
    Timeout                    :                    0
    Truncated key              :                    0
    Bits to test               :                    0
    Data error                 :                    0
    TCP header length error    :                    0
    Stack underflow            :                    0
    Stack overflow             :                    0
    Normal discard             :                   93
    Extended discard           :                    0
    Invalid interface          :                    0
    Info cell drops            :                    0
    Fabric drops               :                    0
Packet Forwarding Engine Input IPv4 Header Checksum Error and Output MTU Error statistics:
    Input Checksum             :                    0
    Output MTU                 :                    
Notes:
  1. If you see a non zero number for drops, verify by running the command multiple times if the counter increases. If the counter is stable, no issue is observed and loss was transient.
  2. If the FPC is continuously rebooting, or generating core dumps, or remains offline, contact your JTAC Representative for further assistance. It is possible that the Fabric Healing was unsuccessful in recovering the FPC. This could be a hardware failure.

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search