Support Support Downloads Knowledge Base Juniper Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

Syslog message: XMCHIP.*DDRIF.*Checksum error for FO/WO.*Channel.*Address.*Checksum Errors

0

0

Article ID: KB31687 KB Last Updated: 06 Oct 2021Version: 6.0
Summary:

The "DDRIF Checksum error" message reports a transient or permanent hardware issue.


This is a Troubleshooting Article for a PFE ASIC Syslog Event.
To view other documented syslog events related to XMCHIP, XLCHIP, MQCHIP, LUCHIP, EACHIP, and PECHIP, see KB31893 - Index of Articles for Troubleshooting PFE ASIC Syslog Events.

.
Symptoms:

When a "DDRIF Checksum error" event occurs, messages similar to the following are reported:​

<Host> <FPC#> XMCHIP(x): DDRIF: Checksum error for FO/WO2 - Channel 16, Address 0x9402c, Checksum Errors 1, Checksum Poison Count 0
<Host> <FPC#> XMCHIP(x): DDRIF: Checksum error for WO1 - Channel 16, Address  0x9402c, Checksum Errors 1, Checksum Poison Count 0
<Host> <FPC#> XMCHIP(x): DDRIF: Checksum error for FO1 - Channel 16, Address 0x9402c, Checksum Errors 1, Checksum Poison Count 0

Indications:

  1. Service impacting, depending on the number of error counters reported

  2. High exposure, causing permanent packet forwarding issues on all remote PFEs in the system if the error is for FO block and continuous for sustained period of time

  3. If the error is reported for the WO block only, then exposure is only local to this PFE reporting the error

 

Cause:

This is a transient or permanent hardware issue for the WAN side or Fabric side or due to software defect.
A transient condition is due to the OCM parity error reported in the syslog.

A CMALARM has been added to enhancement PR1157937. If the checksum error rate is in the range of 5-255/sec, then a minor alarm will be raised. If the rate is more than 255, then a major alarm will be raised. The error counter illustrates the packets that are dropped. Once this error rate reaches 255/sec, the MPC needs to be turned off (see the solution section for the steps to turn off) immediately, as it could expose potential operational impact on all remote PFEs. Usually, such MPC boards fail memory diagnostics tests during the reboot. 
PR1166106 will prevent MPC failing memory BIST tests becoming operational and exposing further outages.

 

Solution:



Perform these steps to determine the cause and resolve the problem (if any).  Continue through each step until the problem is resolved.

  1. Collect the show command output.

    Capture the output to a file (in case you have to open a technical support case). To do this, configure each SSH client/terminal emulator to log your session.

    show log messages
    show log chassisd
    start shell network pfe <fpc#>
    show nvram
    show syslog messages
    exit


  2. Analyze the show command output.

    In the 'show log messages', review the events that occurred at or just before the appearance of the "DDRIF Checksum error" message. Frequently these events help identify the cause.

    • Contact Juniper Support immediately to RMA the card if the FO block is affected, regardless of the error rate

    • The generic pfe-disable event script will detect this condition if error rate is too high and invoke the pfe-disable action

    • Run the CLI command ‘request chassis fpc slot # offline

      • Then configure the FPC to be powered off until it is replaced.

        • ​​This is needed to prevent the risk of operational impact on the remote PFEs

 

This article is indexed in KB31893 - primary Index of Articles for Troubleshooting PFE ASIC Syslog Events; tag XMCHIPTSG


Tip: When looking at an event in the logs, it is important to focus on the first error message in a collection of syslog messages. The first error message is usually the cause of all the follow-on error messages. The follow-on collateral damage error messages can be ignored.

 

Modification History:
2020-08-02: added OCM block PR1530244 to raise major alarm if the parity error events for longer then 5 events
2019-09-30: Article reviewed for accuracy; no changes required.
2017-09-11: Added some clarity around how to offline the fpc and then configure the fpc to be powered off.
Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search