Support Support Downloads Knowledge Base Apex Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

SIB link errors are seen on QFX10000

0

0

Article ID: KB37087 KB Last Updated: 22 Jun 2021Version: 1.0
Summary:

This article describes the SIB link errors observed on QFX10k platforms, how to determine if it is a hardware or software-related issue, and address them accordingly.

Symptoms:

In QFX10k architecture, there is no mid-plane. Each FPC connects directly to all SIBs. When a single CRC (Cyclic Redundancy Check) occurs between a SIB and an FPC, that internal fabric link will be brought down and chassis alarms will be raised. The CRC error is normal to see during digital data transmission, and it does not necessarily indicate software or hardware malfunctions. There is usually no impact on the system if the reported errors have rates that are relatively low.

When the device encounters a CRC link error, alarms similar to the following may be raised:  

user@qfx10008-re0> show chassis alarms
2 alarms currently active
Alarm time               Class  Description
2021-04-13 00:30:55 UTC  Minor  FPC 3 SIB Link Error
2021-04-13 00:30:02 UTC  Minor  FPC 2 SIB Link Error
Solution:
  1. Check the logs to confirm if the chassis alarm is caused by CRC/BER link errors between SIB and FPC.

    Example output in Syslog:

    Mar 20 10:26:48.816  jtac-qfx10008-r2002 : %PFE-3: spmb0 CMQFXSIB: Fabric errors detected for SIB 0 PF 1 Fatal: False PIO Error: False Link Error: True
    Mar 20 10:26:48.818  jtac-qfx10008-r2002 : %PFE-3: spmb0 CMQFXSIB: Link Error on sib 0 plane-index 1 fpc 2 pfe 0
    Mar 20 10:26:49.162  jtac-qfx10008-r2002 : chassisd[7246]: %DAEMON-3-CHASSISD_FM_ERROR: fm_qfx10_ev_sib_state_handler: SIB got link errors (SIB#0, Packet Forwarding Engine 0 on FPC 2)

    Or

    Mar 19 04:10:09.568  jtac-qfx10008-r2002 : %PFE-3: fpc4 CCL: 3 CRC errors seen on link PE1-Avg-28nm-link-1-9
    Mar 19 04:10:09.569  jtac-qfx10008-r2002 : %PFE-5: fpc4 CCL: 1 BER errors seen on PE1-Avg-28nm-link-1-9
    Mar 19 04:10:09.000  jtac-qfx10008-r2002 : %USER-3: fpc4 fpc4 dcpfe: cmqfx: Link errors detected for PFE 1 SIB 4 Plane 8

     
  2. Check the chassis fabric topology output to identify which link(s) has a problem:

    Example:

    user@qfx10008-re0> show chassis fabric topology
     In-link  : FPC# FE# ASIC# (TX inst#, TX sub-chnl #) ->
                SIB# ASIC#_FCORE# (RX port#, RX sub-chn#, RX inst#)
     Out-link : SIB# ASIC#_FCORE# (TX port#, TX sub-chn#, TX inst#) ->
                FPC# FE# ASIC# (RX inst#, RX sub-chnl #)

    -----snip----------
    FPC05FE2(1,10)->S00F1_0(02,4,02) Error   S00F1_0(02,5,02)->FPC05FE2(1,12) Down
    FPC05FE2(1,12)->S00F1_0(02,6,02) Error   S00F1_0(02,7,02)->FPC05FE2(1,10) Down
    FPC05FE2(1,14)->S00F1_0(04,3,04) Error   S00F1_0(04,2,04)->FPC05FE2(1,14) Down
    FPC02FE4(0,03)->S04F1_0(10,0,10) OK      S04F1_0(11,0,11)->FPC02FE4(0,03) Error
    FPC02FE4(0,02)->S04F1_0(10,2,10) OK      S04F1_0(11,2,11)->FPC02FE4(0,02) Error
    FPC02FE4(1,03)->S04F1_0(12,2,12) OK      S04F1_0(13,3,13)->FPC02FE4(1,03) Error
    -----snip----------
  3. Identify if the link error was seen on In-Links (Rx to fabric chip) or if it was seen on Out-links (Rx to FPC).

    Replace PEchip, rx port, and sub-channel based on the information in the ‘show chassis topology’.  
    Collect these CCL outputs 3 times with a gap of a few hours.

    For example, SIB4 with the following error:

    FPC05FE2(1,10)->S04F1_0(02,4,02) Error   S04F1_0(02,5,02)->FPC05FE2(1,12) Down
    S04F1_0(02,5,02), indicates  SIB#4 , PF chip#1, RX port #2, Sub-channel#5

    Use these shell commands:

    cprod -A spmb0 -c "show ccl channel sib4_pf_1 rx2 sc5"
    cprod -A spmb0 -c "show ccl statistics detail sib4_pf_1 rx2 sc5"
    cprod -A spmb0 -c "show syslog messages"
    cprod -A spmb0 -c "show ccl errors"
     

    For FPC2 with the following error:

    FPC02FE4(0,03)->S04F1_0(10,0,10) OK      S04F1_0(11,0,11)->FPC02FE4(0,03) Error
    FPC02FE4(0,03), indicates  FPC#2, PE chip#4, RX port#0, Sub-channel#3

    Use these shell commands:

    cprod -A fpc2 -c "show ccl channel pe4 rx0 sc03"
    cprod -A fpc2 -c "show ccl statistics detail pe4 r0 sc03"
    cprod -A fpc2 -c "show syslog messages"
    cprod -A fpc2 -c "show ccl errors"

  4. Based on the above output, identify if the CRC errors are correctable, which are usually software-related. These errors can be safely ignored if they do not appear repeatedly or in large numbers. If they are NOT correctable on any of the Linecard/SIB, which are usually hardware-related, then they are most likely hardware malfunctions due to broken pins/physical deformity, etc., and a thorough SIB/Linecard hardware inspection is needed.

    For example, in the following output, statistics show respective FPC/PE/RX/Sub-Channel point appears with a certain number of times correctable and un-correctable CRC/BER errors in the last 10 minutes, last 60 minutes, and the last 24 hours. This will help determine whether the hardware part has an issue and needs to be replaced.

    % cprod -A fpc2 -c "show ccl statistics detail pe4 rx0 sc2"
    Name FrameCnt AggrCRCErrCnt
    LastCRC AggrBERCnt
    ==========================================================================
    PE4-Avg-28nm-link-2-10 0x00000000109a5262 0x0000000000000ab6
    0x0 0x00000000000003ab
    0x0 0x0
    phy link 02 FEC uncorrectable, correctable err: current minutes, uncorrectable, correctable
    00000000, 00000000

    last 10m, index: 3               <--- 10 instances at 1 minutes per instance
    record, uncorrectable, correctable
    000, 00000000, 00000000
    001, 00000000, 00000000
    002, 00000000, 00000000
    003, 00000000, 00000000
    004, 00000008, 00000001 
           <---  un-correctable CRC error appears 8 times and correctable error appears once 4 minutes ago
    005, 00000000, 00000000
    006, 00000000, 00000000
    007, 00000000, 00000000
    008, 00000000, 00000000
    009, 00000004, 00000001    
    <--- un-correctable CRC error appears 4 times and correctable error appears once 10 minutes ago
    last 1h, index: 2                   <--- 6 instances at 10 minutes per instance
    record, uncorrectable, correctable
    000, 00000010, 00000002        
    001, 00000004, 00000000
    002, 0000000c, 00000003
    003, 0000000d, 00000000
    004, 00000000, 00000000
    005, 00000014, 00000002 
           <--- un-correctable CRC error appears 14 times and correctable error appears twice in the last 1 hour, and so on
    last day, index: 15                 <--- 24 instances at 1 hour per instance
    record, uncorrectable, correctable
    000, 00000056, 00000008
    001, 00000060, 00000004
    002, 00000087, 0000000d
    003, 00000070, 0000000c
    004, 00000076, 00000010
    005, 0000006b, 0000000a
    006, 0000004a, 00000009
    007, 0000004a, 00000006
    008, 00000051, 00000005
    009, 0000004e, 0000000a
    010, 0000006b, 00000005
    011, 00000064, 0000000e
    012, 0000004c, 00000004
    013, 0000006c, 00000009
    014, 00000040, 00000004
    015, 00000046, 00000006
    016, 00000033, 00000007
    017, 00000041, 00000007
    018, 00000053, 00000009
    019, 00000042, 00000007
    020, 00000041, 00000005
    021, 00000030, 00000009
    022, 0000005a, 0000000f
    023, 00000048, 00000003
    phy link 02 CRC aggregation error and rate:
    current minutes,
    total, rate
    00000000, 00000000
    last 10m, total: 00000009, rate: 00000000, index: 3
    record, sub-total, sub-rate
    000, 00000000, 00000000
    001, 00000000, 00000000
    002, 00000000, 00000000
    003, 00000000, 00000000
    004, 00000006, 00000000
    005, 00000000, 00000000
    006, 00000000, 00000000
    007, 00000000, 00000000
    008, 00000000, 00000000
    009, 00000003, 00000000
    last 1h, total: 00000031, rate: 00000000, index: 2
    record, sub-total, sub-rate
    000, 0000000c, 00000000
    001, 00000003, 00000000
    002, 00000009, 00000000
    003, 0000000a, 00000000
    004, 00000000, 00000000
    005, 0000000f, 00000000
    last day, total: 000005da, rate: 00000000, index: 15
    record, sub-total, sub-rate
    000, 00000040, 00000000
    001, 00000048, 00000000
    002, 00000066, 00000000
    003, 00000051, 00000000
    004, 00000056, 00000000
    005, 00000051, 00000000
    006, 00000035, 00000000
    007, 00000038, 00000000
    008, 0000003d, 00000000
    009, 0000003b, 00000000
    010, 0000004f, 00000000
    011, 0000004b, 00000000
    012, 00000036, 00000000
    013, 00000051, 00000000
    014, 00000030, 00000000
    015, 00000032, 00000000
    016, 00000024, 00000000
    017, 00000031, 00000000
    018, 0000003b, 00000000
    019, 00000032, 00000000
    020, 00000031, 00000000
    021, 00000025, 00000000
    022, 0000003e, 00000000
    023, 00000036, 00000000
    1. If it is a software issue, check if PR1435705 - SIB/FPC Link Error alarms might be observed on QFX10K due to a single CRC is applicable applies.

      On QFX10k platforms, CRC threshold was incorrectly set to zero on JUNOS releases before 18.2R3-S6. In order to fix this incorrect setting, the CRC threshold has been increased to a reasonable level (5 errors per minute) in 18.2R3-S6 and later releases.
      If not, reboot the respective FPC/SIB to clear the correctable errors.
    2. If it is a hardware related issue, then perform a thorough hardware inspection for any physical deformity, bent pins, or alignment issues on any of the Linecard/SIB which is affecting the connectivity.  

      If none is found, proceed with FPC/SIB replacement one by one and verify the error status at every step.
      If errors are still seen after that, swap the FPC/SIB slot one by one to verify if the issue follows the card [FPC/SIB] or slot.

Please contact your JTAC Representative for assistance with the investigation.

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search