Support Support Downloads Knowledge Base Juniper Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

Syslog message: Double-bit ECC error

0

0

Article ID: KB32154 KB Last Updated: 11 Oct 2021Version: 3.0
Summary:

The "Double-bit ECC error" message is caused due to transient Hardware error

This is a Troubleshooting Article for a PFE ASIC Syslog Event.
To view other documented syslog events related to XMCHIP, XLCHIP, MQCHIP, LUCHIP, EACHIP, and PECHIP, see KB31893 - Index of Articles for Troubleshooting PFE ASIC Syslog Events.

.

Symptoms:

When a "Double-bit ECC error" event occurs messages similar of the following are reported:

EACHIP:

[Sep 7 13:06:35.434 LOG: Err] EA[0:0].idmem_slice[0].protect CAE[0], ext_addr 0x00000005 -> ECC Error @ IDMEM[0xfbf80005]
[Sep 7 13:06:35.434 LOG: Err] EA[0:0]_PPE 72 Errors sync xtxn error
[Sep 7 13:06:35.434 LOG: Err] EA[0:0]_PPE 0 Errors sync xtxn error
[Sep 7 13:06:35.434 LOG: Err] EA[0:0]_PPE 1 Errors sync xtxn error
[Sep 7 13:06:35.434 LOG: Debug] Cmerror: Draining ASIC error message queue
[Sep 7 13:06:35.434 LOG: Debug] cmerror_process_queue: module = LKUP-EA[0:0]
[Sep 7 13:06:35.434 LOG: Debug] Cmerror: processing the task op_type 1 for level 1 level_count 0 occur_count 0 clear_count 0 level_threshold 1 level_action 0x44 item errid 262176 item_threshold 1 item_count 0 item_sub_err_state 0 sub_item errid 0 sub_item_state 0 item_times
[Sep 7 13:06:35.434 LOG: Debug] Cmerror: Level 1 count increment 1 occur_count 1 clear_count 0
[Sep 7 13:06:35.434 LOG: Info] Error (0x40020), module: LKUP-EA[0:0], type: Double-bit ECC error
[Sep 7 13:06:35.434 LOG: Debug] Cmerror: Level 1 count 1 (occur_count 1 clear_count 0)crossed threshold 1 action 0x44
[Sep 7 13:06:35.434 LOG: Debug] cmerror_take_action_helper: performing action 4 for level 1 err_id 262176 module id 24
[Sep 7 13:06:35.434 LOG: Debug] cmerror_take_action_helper: performing action 2 for level 1 err_id 262176 module id 24
[Sep 7 13:06:35.599 LOG: Err] EA[0:0].llm: Host XTXN 14 has idle error status (FCODE_Addr 0x0000002cfbf80005).
[Sep 7 13:06:36.445 LOG: Err] EA[0:0].idmem_slice[0].protect Read Error @ IDMEM[0xfbf80005], now showing valid data (0x0000015e5c7306c8)
[Sep 7 13:06:36.445 LOG: Err] EA[0:0]_PPE 72 Errors sync xtxn error

XLCHIP:

[Sep 7 13:09:24.042 LOG: Err] XL[0:0].idmem_slice[0].protect CAE[0], ext_addr 0x00000005 -> ECC Error @ IDMEM[0x7bfa0005]
[Sep 7 13:09:24.046 LOG: Debug] Cmerror: Draining ASIC error message queue
[Sep 7 13:09:24.046 LOG: Debug] cmerror_process_queue: module = XL[0:0]
[Sep 7 13:09:24.046 LOG: Debug] Cmerror: processing the task op_type 1 for level 1 level_count 0 occur_count 0 clear_count 0 level_threshold 1 level_action 0x44 item errid 262176 item_threshold 1 item_count 0 item_sub_err_state 0 sub_item errid 0 sub_item_state 0 item_times
[Sep 7 13:09:24.046 LOG: Debug] Cmerror: Level 1 count increment 1 occur_count 1 clear_count 0
[Sep 7 13:09:24.046 LOG: Info] Error (0x40020), module: XL[0:0], type: Double-bit ECC error
[Sep 7 13:09:24.046 LOG: Debug] Cmerror: Level 1 count 1 (occur_count 1 clear_count 0)crossed threshold 1 action 0x44
[Sep 7 13:09:24.046 LOG: Debug] cmerror_take_action_helper: performing action 4 for level 1 err_id 262176 module id 9
[Sep 7 13:09:24.046 LOG: Debug] cmerror_take_action_helper: performing action 2 for level 1 err_id 262176 module id 9
[Sep 7 13:09:24.292 LOG: Err] XL[0:0].llm: Host XTXN 13 has idle error status (FCODE_Addr 0x0000002c7bfa0005).

Indications:

  • A fatal alarm condition will restart FPC automatically to recover.

 

Cause:

On-chip IDMEM memory is ECC protected. If double-bit ECC is detected, a fatal error is raised with level emergency. The default actions for a Fatal CMERROR is restarting the FPC. If the same conditions occur again after a restart, hardware replacement is recommended. False positive double-bit ECC error conditions have been corrected within PR1032958.

Solution:



Perform these steps to determine the cause and resolve the problem (if any).  Continue through each step until the problem is resolved.

  1. Collect the show command output.

    Capture the output to a file (in case you have to open a technical support case). To do this, configure each SSH client/terminal emulator to log your session.

    show log messages
    show log chassisd
    start shell network pfe <fpc#>
    show nvram
    show syslog messages
    exit

  2. Analyze the show command output.

In the 'show log messages', review the events that occurred at or just before the appearance of the "Double-bit ECC error" message. Frequently these events help identify the cause.

  • Contact your technical support representative if the event is reported again after FPC restart to RMA the FPC

This article is indexed in KB31893 - Primary Index of Articles for Troubleshooting PFE ASIC Syslog Events; tag EACHIPTSG XLCHIPTSG


Tip: When looking at an event in the logs, it is important to focus on the first error message in a collection of syslog messages. The first error message is usually the cause of all the follow-on error messages. The follow-on collateral damage error messages can be ignored.

 

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search