Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[MX/PTX/SRX] Minor Alarm: "Potential slow peers are: x"

0

0

Article ID: KB36339 KB Last Updated: 15 Dec 2020Version: 1.0
Summary:

This article explains the occurrence of the minor alarm, which states that "Potential slow peers" are being noted on the system/Routing Engine. By the end of the article, users will be able to identify the cause for this behavior and know what initial data to collect for analysis by Support in case the issue is not solved via self-help.

 

Symptoms:

Error Log

Jul  8 15:04:52.856  alarmd[15278]: %DAEMON-4: Alarm set: RE color=YELLOW, class=CHASSIS, reason=Potential slow peers are: dfwd
Jul  8 15:04:52.856  craftd[13594]: %DAEMON-4: Minor alarm set , Potential slow peers are: dfwd

Alarm

user@host> show chassis alarms no-forwarding

1 alarms currently active
Alarm time               Class  Description
2020-07-08 15:04:52 CEST Minor  Potential slow peers are: dfwd

 

Cause:

The "Potential slow peers are: X" messages are reporting/debugging messages about flow control of the KRT, which can happen if a big burst of RPD updates take place due to network events (or flapping of links to next-hops, IGP/BGP convergence, high CPU, dump writing and so on).

 

Solution:

The alarm being of minor type should not ideally be of much impact at the time it is being triggered. In case the alarm is triggered, since it affects the overall CPU of the device, some impact will be seen.

In certain cases, the alarm might disappear on its own with no impact being seen on the device while it is active. In these cases, logs must be checked to find the root cause.

The following section provides some reasons and how to troubleshoot the alarm.

First, check the CPU utilization of the device/process that is responsible for the alarm:

user@host> show system processes extensive
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND

14962 root   103 0    859M 135M RUN  0  82.7H 100.00% dfwd      --> dfwd is stuck at 100%

As seen above, note that the process for which the alarm was generated has CPU utilization at 100%. In such cases, it is important to restart the process in order to re-initiate its functioning.

Caution: The process must be restarted during a Maintenance Window. If you are unaware of the criticalities of restarting a process, contact Support before proceeding with this step.

Steps to Restart

user@host>start shell user root
%kill -9 14962       ---> 14962 indicates the PID as obtained from the system processes output.

After the above step is performed, confirm whether the process has respun by running the show system processes extensive | match dfwd command. Contact Support if you observe any discrepancies. 

Sometimes, when processes are running normally as per system process output, the alarm may occur due to RE utilization. Check RE utilization to confirm.

user@host>show chassis routing-engine no-forwarding
<snip>
15 min CPU utilization:
User                       2 percent
Background                 0 percent
Kernel                    75 percent         ----> High Kernel Utilization

Use the following command to identify the reason for kernel spike.

>show log messages 

In the log messages, you may see JLOCK hog messages being continuously written, which could be the reason for the kernel spike. This is seen to be one of the common reasons for increased kernel utilization.

JLOCK hog may or may not be accompanied by service impact depending on the processes that are reporting JLOCK hog and why (root cause). Such a log message means that a specified thread held JLOCK for more than the allowed time, which is why it was reported continuously. JLOCK hog can occur for various reasons and have different outcomes. For example, on a heavily stressed box, these messages may be seen due to instability in the network that is active at the time of the issue. The maximum time that a thread can hold JLOCK is ideally limited to avoid impact on other threads/processes and the system in general.

If you see JLOCK hogs on a stable device (without noticeable impact), then maybe there is a software issue. In conclusion, we can say that the kernel was very busy due to instability of the network. However, it may hard to present a concrete list with facts in these circumstances and without a coredump from the moment in question.

If such occurrences are seen where kernel utilization is high, collect the following command output for JTAC analysis.

>show krt queue
>show krt state

>start shell
% rtsockmon -t > /var/tmp/rtsockmon.txt       

If the logs are not transparent enough to determine the reason, reach out to Support for analysis. Collect the following output information before opening a case for faster analysis:

> show log messages
> request support information | no-more
> show krt queue
> show krt state
>start shell
% rtsockmon -t > /var/tmp/rtsockmon.txt 

 

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search