Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[PTX/QFX] Syslog message: PE CHIP :: FATAL ERROR “HMCIF: Link0: HMC Fatal Error cmd:62

0

0

Article ID: KB33920 KB Last Updated: 12 Jun 2020Version: 3.0
Summary:

This article describes the "PE CHIP :: FATAL ERROR  “HMCIF: Link0: HMC Fatal Error cmd:62”  syslog message that is reported on PTX devices, and explains how to recover from the error.

 

Symptoms:

The error is reflected as shown below:

Syslog messages

<Date>  <router name>  <FPC#> Cmerror Op Set: PE Chip::FATAL ERROR!! from PE1[1]: HMCIF: Link0: HMC Fatal Error cmd:62 lng:1 ltag:0 dinv:0 errstat:127 err_cnt:0x40000000
<Date>  <router name>  <FPC#> Cmerror Op Set: PE Chip::FATAL ERROR!! from <PFE ID>: HMCIF: Link#: HMC Fatal Error cmd:62 lng:1 ltag:0 dinv:0 errstat:127 err_cnt:0x40000000
<Date>  <router name>  <FPC#> Cmerror Op Set: PE Chip::FATAL ERROR!! from <PFE ID>: HMCIF: Link#: HMC Fatal Error cmd:62 lng:1 ltag:0 dinv:0 errstat:127 err_cnt:0x40000000
<Date>  <router name>  <FPC#> JPRDS_FDB:ERR:jprds_fdb_set_pfe_disabled(),7317: <pfe id> marked disabled
<Date>  <router name>  <FPC#> hmc_eri_config_access, HMC #, read eri 10 timeout error
<Date>  <router name>  <FPC#>   Dumping Micron HMC # FATAL ERR DUMP 3181 entries ...
<Date>  <router name>  <FPC#> micron_hmc_get_runtime_info, hmc # eri 10 error
<Date>  <router name>  <FPC#> cmsngfpc_hmc_temp_check, HMC # <PFE ID>-HMC0-DIE run time info read error
<Date>  <router name>  <FPC#> Cmerror Op Set: Generic HMC::FATAL ERROR!! from HMC#-#-#: eri timeout error
 

Note that PFE is disabled due to fatal error as shown in the below output:

<router name>  > show chassis fpc errors

FPC  Level Occurred Cleared Threshold Action-Taken Action
#   Minor      0      0     10      0   LOG|
    Major      0      0       1      0 GET STATE|ALARM|
    Fatal      8      0      1     16 DISABLE PFE
    Pfe-State:  pfe-# -DISABLED
 

When you log in to the vty of the affected FPC, you see the following:

router-name > start shell pfe network fpc#
<FPC#> (PTX10008-re0 vty)# show pechip #
PE[1] : ASIC Name: PE1, ASIC ID: 1
Version 2.0 JTAG ID: 692393343
Initialized: Yes Fault/Disabled: Yes   >>>> PFE is disabled (Marked down)

<FPC#> <router name>(PTX10008-re0 vty)# show cmerror module brief
---------------------------------------
Module  Name              Active Errors
---------------------------------------
1       PCIe Error        0
2       CPU Error         0
3       Eth Port Error    0
4       Host Loopback     0
5       Generic HMC       5    >>>>>>> Active Errors
6       TOE-PE-0:0:0      0
7       TOE-PE-0:0:1      0
8       TOE-PE-0:0:2      0
9       PE Chip           3    >>>>>>> Active Errors
10      TOE-PE-1:0:0      0
11      TOE-PE-1:0:1      0
12      TOE-PE-1:0:2      0
13      TOE-PE-2:0:0      0
14      TOE-PE-2:0:1      0
15      TOE-PE-2:0:2      0
16      TOE-PE-3:0:0      0
17      TOE-PE-3:0:1      0
18      TOE-PE-3:0:2      0
19      TOE-PE-4:0:0      0
20      TOE-PE-4:0:1      0
21      TOE-PE-4:0:2      0
22      TOE-PE-5:0:0      0
23      TOE-PE-5:0:1      0
24      TOE-PE-5:0:2      0
25      BCM Switch        0
26      FPC               0

 

Cause:

These are fatal errors on the HMCIF Link between HMC-PFE. Because these errors can affect traffic, they are marked as fatal and the PFE that has these errors is marked as disabled (all ports relating to this PFE will be down).

 

Solution:

To clear the alarm, restart the FPC via the Command Line Interface (CLI) by using the following syntax:

request chassis fpc (offline | online | restart) slot slot-number"
 

After restarting the FPC, make sure that the PFE is enabled:

<FPC#>(PTX10008-re0 vty)# show pechip <PFE ID>
PE[#] : ASIC Name: PE#, ASIC ID: #
Version 2.0  JTAG ID: XXXXXX
Initialized: Yes  Fault/Disabled: No    ////////// PFE is not Disabled ( enabled )

<router name>(PTX10008-re0 vty)# show cmerror module brief
---------------------------------------
Module  Name              Active Errors
---------------------------------------
1       PCIe Error        0
2       CPU Error         0
3       Eth Port Error    0
4       Host Loopback     0
5       Generic HMC       0    >>>>> No Active Errors
6       TOE-PE-0:0:0      0
7       TOE-PE-0:0:1      0
8       TOE-PE-0:0:2      0
9       PE Chip           0   >>>>>> No Active Errors
10      TOE-PE-1:0:0      0
11      TOE-PE-1:0:1      0
12      TOE-PE-1:0:2      0
13      TOE-PE-2:0:0      0
14      TOE-PE-2:0:1      0
15      TOE-PE-2:0:2      0
16      TOE-PE-3:0:0      0
17      TOE-PE-3:0:1      0
18      TOE-PE-3:0:2      0
19      TOE-PE-4:0:0      0
20      TOE-PE-4:0:1      0
21      TOE-PE-4:0:2      0
22      TOE-PE-5:0:0      0
23      TOE-PE-5:0:1      0
24      TOE-PE-5:0:2      0
25      BCM Switch        0
26      FPC               0
 

If the issue is not resolved, open a Technical Service Request with the following logs:

Collect the following information from CLI:

Collect the following information from VTY:

<FPC#>(PTX10008-re0 vty)# show pechip <PFE ID>
<FPC#>(PTX10008-re0 vty)# show cmerror module brief

A script can be set up in the device so that once the HCM error appears, the FPCs are rebooted immediately since Junos OS software resiliency performs "disable-pfe" action per default and shuts down all WAN interfaces. However, the system continues to install active routes and their next-hops into the faulty HMC memory.  Junos OS 18.1/18.2 within CM2.0 does allow action based on error-id. Today, all alarm actions are based on severity and would apply to all fatal events. PTX fatal events default action is disable-pfe. The way to recover from this error is to reboot the FPC. This script will have the FPC reboot within any HCM fatal error.

set event-options policy restart-fpc-on-hmc-error-slot0 events PIC
set event-options policy restart-fpc-on-hmc-error-slot0 within 3 trigger on
set event-options policy restart-fpc-on-hmc-error-slot0 within 3 trigger 1
set event-options policy restart-fpc-on-hmc-error-slot0 attributes-match PIC.message matches "fpc0 .*PE Chip::FATAL ERROR!! from .* HMCIF"
set event-options policy restart-fpc-on-hmc-error-slot0 then execute-commands commands "request chassis fpc slot 0 restart"
 
set event-options policy restart-fpc-on-hmc-error-slot1 events PIC
set event-options policy restart-fpc-on-hmc-error-slot1 within 3 trigger on
set event-options policy restart-fpc-on-hmc-error-slot1 within 3 trigger 1
set event-options policy restart-fpc-on-hmc-error-slot1 attributes-match PIC.message matches " fpc1 .*PE Chip::FATAL ERROR!! from .* HMCIF"
set event-options policy restart-fpc-on-hmc-error-slot1 then execute-commands commands "request chassis fpc slot 1 restart"


Under the following Junos releases:

  • 16.1X65-D46
  • 17.2R1-S5
  • 17.2R3
  • 17.2X75-D70
  • 17.3R2, 17.4R1
  • 18.1R1
  • 18.1X75-D10

Junos OS software resiliency has been enhanced to skip route/next-hop programming and queue statistics collections upon "disable-pfe" action to prevent those symptoms highlighted above. This does not prevent the HMC fatal error, only reduces the operational impact upon such failure.

The enhancement will correct this and make sure that disabling one PFE (1 out of 6, losing 16.6% of forwarding capacity) does not affect TOE. Chassis wide alarm will be raised, and customer can schedule the reboot of the node in a MW.

Modification History:

2019-12-28: Article reviewed for accuracy. Minor changes made. Article is correct and complete.
2020-06-12: Added script to bottom of the Solution.

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search