Knowledge Search


×
 

[EX] Understanding the soft error recovery feature on PFE

  [KB34383] Show Article Properties


Summary:

This article describes the soft error recovery feature on EX4300 device.

Symptoms:

EX4300 may experience packet drop silently without any alarm or log.

Troubleshooting further reveals incorrect PFE programming. This kind of PFE programming issue can occur from any register or memory entries, including the register/memory entries that programmed the correct value by control plane and the error was caused by memory/register bit flaps. This is the parity error. Such events are rare per device, but become more visible in large-scale deployments.

Cause:

The following reasons may cause a parity error:

  1. Emission of alpha particles from tiny amounts of radioactive materials present in the chips.
  2. Cosmic rays creating energetic neutrons and protons.
Solution:

The parity error causes non-permanent damage of the PFE internal memory and register, and it is correctable. 

Soft error recovery is the software feature used to restore the PFE internal memory and registers. Once enabled, the PFE can detect and try to recover the parity error by itself, and also print a log. The parity error will cause a slight packet drop. But once recovered, the PFE can continue forward traffic correctly. 

The feature was enabled on EX4300's PFE since 14.1X53-D51 release. There is no new Junos CLI to configure it. It can only be enabled by upgrading the software.

The following steps can be used to verify if the feature is enabled or not:

start shell
root@ex4300:0% cprod -A fpc0 -c 'set ex getconfig' | grep parity
parity_correction = 1  <-- 1 means parity correction enabled
parity_enable = 1      <-- 1 means parity check enabled

root@s08-9:RE:0% vty fpc0
BSD platform (QorIQ P202H processor, 0MB memory, 0KB flash)

(vty)# set exbcm bcmshell "memscan"
MemSCAN: Running on unit 0    <-- Running means the memscan is running, otherwise, the parity error from dynamic memories may not able to detect upon access it
MemSCAN:   Interval: 100000 usec
MemSCAN:   Rate: 64

The following logs are the parity error detection and recovery log. In case such log available, you can ignore it since it is already recovered by the PFE.

[Fri May 24 09:48:44 2019 LOG: Err] Unit: 0 
[Fri May 24 09:48:44 2019 LOG: Err] Mem: [Fri May 24 09:48:44 2019 LOG: Err] Parity error..
[Fri May 24 09:48:44 2019 LOG: Err] Error in: SBUS transaction.
[Fri May 24 09:48:44 2019 LOG: Err] Blk: 2, Address: 0x04400001, base: 0x10, stage: 1, index: 1
[Fri May 24 09:48:44 2019 LOG: Err] Unit 0: mem: 478=EGR_DVP_ATTRIBUTE_1 blkoffset:4
[Fri May 24 09:48:44 2019 LOG: Err] Unit 0: CLEAR_RESTORE: EGR_DVP_ATTRIBUTE_1[478] blk: epipe0 index: 1 : [2][4400000]   //indicated the register error was cleared


[Fri May 24 09:37:53 2019 LOG: Err] Unit: 0 
[Fri May 24 09:37:53 2019 LOG: Err] Mem:
[Fri May 24 09:37:53 2019 LOG: Err] Parity error..
[Fri May 24 09:37:53 2019 LOG: Err] Error in: SOP cell.
[Fri May 24 09:37:53 2019 LOG: Err] Blk: 16, Address: 0x00001444, base: 0x0, stage: 0, index: 5188
[Fri May 24 09:37:53 2019 LOG: Err] Unit 0: mem: 3678=RAW_ENTRY_TABLE blkoffset:8
[Fri May 24 09:37:53 2019 LOG: Debug] STATUS: 0x00000083
[Fri May 24 09:37:53 2019 LOG: Debug] OPCODE: 0x1d000200
[Fri May 24 09:37:53 2019 LOG: Debug] START ADDR: 0x79c9cb60
[Fri May 24 09:37:53 2019 LOG: Debug] CUR ADDR: 0x1c001400
[Fri May 24 09:37:53 2019 LOG: Err] _soc_mem_array_sbusdma_read: L2_ENTRY_1.ism0 failed(ERR)
[Fri May 24 09:37:53 2019 LOG: Err] H/W received sbus nack with error bit set.
[Fri May 24 09:37:53 2019 LOG: Err] Unit: 0 
[Fri May 24 09:37:53 2019 LOG: Err] Multiple:
[Fri May 24 09:37:53 2019 LOG: Err] Mem:
[Fri May 24 09:37:53 2019 LOG: Err] Parity error..
[Fri May 24 09:37:53 2019 LOG: Err] Error in: SBUS transaction.
[Fri May 24 09:37:53 2019 LOG: Err] Blk: 16, Address: 0x1c001444, base: 0x0, stage: 7, index: 5188
[Fri May 24 09:37:53 2019 LOG: Err] Unit 0: mem: 2017=L2_ENTRY_1 blkoffset:8
[Fri May 24 09:37:53 2019 LOG: Err] Unit 0: CLEAR_RESTORE: L2_ENTRY_2[2018] blk: ism0 index: 2594 : [16][1c000000]    //indicated the memory error cleared
Related Links: