Support Support Downloads Knowledge Base Apex Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[CSO] Operational status of site DOWN in UI but CPE state physically UP

0

0

Article ID: KB36869 KB Last Updated: 05 May 2021Version: 1.0
Summary:

This article discusses the state of a customer premises equipment (CPE) when the Contrail Service Orchestration (CSO) UI displays an alarm that the CPE is DOWN but when an admin logs in to the CPE, it shows as being UP.

Cause:

CSO 5.1.2 and earlier manage two types of CPEs (at the time of writing this article): NFX and SRX. For NFX, we have a device agent, whereas for SRX, we have a cloud agent for monitoring. Refer to KB36422 - [CSO] How are CPE devices monitored? for details.

For NFX, the said agent collects device state and sends a probe to CSO periodically. This periodic event occurs every 60 seconds. Along with device state, it also gathers the IPsec tunnel status and sends this information with the same probe. If the number of tunnels on a site is more, this tunnel status gathering takes time and there is a delay in the probe to be sent to the CSO.

On the CSO side, the Icinga Monitoring system looks for an event every 90 seconds and if this check is missed, the operational status of the CPE is marked as DOWN.

How can we check/confirm the same on NFX?

Log in to the JDM shell, navigate to the /var/log/telemetry folder, and run the cat command to get the agent.log file to check for errors such as the following:

2021-04-22 09:29:08 UTC | INFO | CSO-telemetry-agent | MainThread | MainProcess | telemetry-agent(daemon.py:143) | Collection took 34.9759478569 which is less than minimum reporting frequency 60,sleeping for 25.0240521431

The above will indicate the amount of time the agent is taking to perform the status gathering task. If the same is above 60, then we have a problem.

Solution:

CSO 5.4 and 6.0 versions have enhanced capability to track this behavior.

One possible way to avoid this issue is to increase the check time for probes for the Icinga CSO component. Increasing the check time to say 5 minutes could save the CSO UI from reporting these alarms.

To change the check time for probes, perform the following:

  1. Change the check_interval from 90 to 300 in the /etc/icinga2/conf.d/templates.conf file.

template Host "UCPE_DEVICE-host" {
max_check_attempts = 3
check_interval = 90
retry_interval = 65
enable_notifications = 1
enable_event_handler = 1
enable_flapping = 0
enable_perfdata = 0
check_command = "check_reachability"
enable_active_checks = 1
enable_passive_checks = 1
}
  1. Restart Icinga.

restart icinga → systemctl restart icinga2

Note: The disadvantage in doing this is that a true alarm will be missed for the duration of time that the new check_interval icinga is set for.

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search