This article explains why the system process may show high CPU utilization in RE-DUO-2600 RE on PTX 3k/5k due to process irq17: uhci1 uhci4*, and what the corrective actions are to resolve the issue.
CPU utilization remains high due to an interrupt storm (irq17) coming from the RE/CB hardware.
User@PTX3000-re0> show chassis hardware
Hardware inventory:
Item Version Part number Serial number Description
Chassis JN1258164AJC PTX3000
Midplane REV 25 750-044645 ACMJ2769 Backplane
FPM REV 07 760-044663 ACNF3954 Front Panel Display
PSM 0 REV 04 740-044980 1EDJ5310794 DC 12V Power Supply
PSM 1 REV 04 740-044980 1EDJ5310779 DC 12V Power Supply
PSM 2 REV 04 740-044980 1EDJ5310789 DC 12V Power Supply
PSM 3 REV 04 740-044980 1EDJ5310815 DC 12V Power Supply
Routing Engine 0 REV 12 740-026942 P737A-006537 RE-DUO-2600
ad0 3807 MB SMART CF SPG2014121202330 Compact Flash
ad1 59488 MB VSFA18PI064G-EM 35187-203 Disk 1
Routing Engine 1 REV 12 740-026942 P737A-005826 RE-DUO-2600
ad0 3807 MB SMART CF SPG2014081302009 Compact Flash
ad1 57241 MB SGC13T064-TS9KBC-EM SO141015AS1569841 Disk 1
CB 0 REV 17 750-044656 ACDS7924 Control Board
CB 1 REV 17 750-044656 ACDR9985 Control Board
User@PTX3000-re0> show system processes extensive
last pid: 27215; load averages: 1.17, 0.86, 0.81 up 916+23:25:36 11:05:06
166 processes: 3 running, 141 sleeping, 1 zombie, 21 waiting
Mem: 787M Active, 96M Inact, 356M Wired, 499M Cache, 214M Buf, 14G Free
Swap: 3327M Total, 3327M Free
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
23 root 1 -84 -187 0K 16K RUN 108.0H 76.71% irq17: uhci1 uhci4*
10 root 1 155 52 0K 16K RUN ??? 5.57% idle
2116 root 2 -26 -26 149M 113M nanslp 1334.5 5.57% chassisd
User@PTX3000-re0> show chassis routing-engine
Routing Engine status:
Slot 0:
Current state Master
Election priority Master
Temperature 34 degrees C / 93 degrees F
CPU temperature 56 degrees C / 132 degrees F
DRAM 16359 MB (16384 MB installed)
Memory utilization 11 percent
5 sec CPU utilization:
User 6 percent
Background 0 percent
Kernel 13 percent
Interrupt 78 percent <--------------------- High
Idle 3 percent
1 min CPU utilization:
User 2 percent
Background 0 percent
Kernel 9 percent
Interrupt 77 percent <--------------------- High
Idle 12 percent
5 min CPU utilization:
User 2 percent
Background 0 percent
Kernel 8 percent
Interrupt 77 percent <--------------------- High
Idle 13 percent
15 min CPU utilization:
User 1 percent
Background 0 percent
Kernel 8 percent
Interrupt 77 percent <--------------------- High
Idle 14 percent
Model RE-DUO-2600
Serial ID P737A-006537
Start time 2017-03-06 11:40:00 JST
Uptime 916 days, 23 hours, 25 minutes, 12 seconds
Last reboot reason Router rebooted after a normal shutdown.
Load averages: 1 minute 5 minute 15 minute
1.24 0.88 0.82
Routing Engine status:
Slot 1:
Current state Backup
Election priority Backup
Temperature 42 degrees C / 107 degrees F
CPU temperature 63 degrees C / 145 degrees F
DRAM 16359 MB (16384 MB installed)
Memory utilization 10 percent
5 sec CPU utilization:
User 0 percent
Background 0 percent
Kernel 0 percent
Interrupt 0 percent
Idle 100 percent
Model RE-DUO-2600
Serial ID P737A-005826
Start time 2017-03-06 11:18:04 JST
Uptime 916 days, 23 hours, 46 minutes, 50 seconds
Last reboot reason Router rebooted after a normal shutdown.
Load averages: 1 minute 5 minute 15 minute
0.00 0.00 0.00
In this case, the problem is due to an interrupt storm on a failing RE/CB that is causing it to become overloaded or saturated.
From the show system boot-messages
command output, irq 17
is seen to be shared by multiple devices (PCI bridge, USB controller, SATA controller, and SMB bus controller) that are connected via the PCI bus on the RE. Therefore, it is hard to identify the device that is actually sending these frequent interrupts.
$ grep "irq 17" RSI.log
pcib7: <MPTable PCI-PCI bridge> mem 0xdec00000-0xdec0ffff irq 17 at device 14.0 on pci8
pcib11: <PCI-PCI bridge> irq 17 at device 2.0 on pci11
pcib14: <PCI-PCI bridge> irq 17 at device 6.0 on pci11
pcib17: <PCI-PCI bridge> irq 17 at device 10.0 on pci11
pcib19: <PCI-PCI bridge> irq 17 at device 14.0 on pci11
uhci1: <UHCI (generic) USB controller> port 0x1840-0x185f irq 17 at device 26.1 on pci0
uhci4: <UHCI (generic) USB controller> port 0x18a0-0x18bf irq 17 at device 29.1 on pci0
atapci0: <Intel ICH9 SATA300 controller> port 0x1c50-0x1c57,0x1c44-0x1c47,0x1c48-0x1c4f,0x1c40-0x1c43,0x18e0-0x18ff mem 0xdeb01000-0xdeb017ff irq 17 at device 31.2 on pci0
ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x1c00-0x1c1f mem 0xdeb02000-0xdeb020ff irq 17 at device 31.3 on pci0
However, the root cause for high CPU (irq17) is currently not known. It is still unclear if this issue is related to hardware or software.
If this issue is seen only on one specific RE/CB pair, and it cannot be replicated by using other RE/CB pairs, then it is most likely due to one defective RE (or CB), which needs to be RMAed. The RE-DUO-2600 Routing Engine mentioned in this article is sourced from a third party (OEM product) vendor, and it has been marked EOL/EOS since late 2008.
The recommended course of action is to restart the RE immediately, which will result in a decrease in CPU usage. If high CPU conditions persist even after restarting the RE, we need to suspect an issue related to the hardware (RE/CB). The defective unit RE (or CB) should then be identified and RMAed.
Note: The high CPU issue can be seen both on the Primary and Backup REs. If the problem is seen on the Primary RE, the user NOC should restart the RE immediately to avoid unwanted issues such as protocol flap, which may be caused by RE CPU resource exhaustion.