Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[Subscriber-Management] MX Virtual Chassis Best Practices

0

0

Article ID: KB32740 KB Last Updated: 03 Apr 2021Version: 3.0
Summary:

This article provides guidelines, known behaviors, and best practices related to the use of the MX Virtual Chassis (MX-VC).

Solution:

The following recommendations pertain to MX-VC deployments:

  • Routing Engine: Routing Engines (REs) that are installed on two systems on the same platform (MX and MX vs MX and MX2020) should have the same type with the same amount of memory. REs with 32 GB memory will have a higher scaling capacity and will use 64-bit daemons such as the Subscriber Management Daemon (SMGD). If member0 has 32 GB REs and member1 has 16 GB REs of the same type, SMGD will not be able to sync. The SMGD on the 16 GB member will be running in 32-bit mode and the 32 GB RE member will be running in 64-bit mode.

  • FPC/MPC Support: It is possible that newer Flexible PIC Concentrators (FPCs) and Modular Port Concentrators (MPCs) may not be supported in the release that is being used. To be sure that the hardware is compatible with the code, refer to Protocols and Applications Supported on MPCs for MX Series Routers.

  • Virtual-Chassis VCP location: Virtual-Chassis Control Port (VCP) links should have their own dedicated FPC. This is to avoid control packet congestion or contention on the FPC that is hosting the VC control channel.

    • Recommended VCP links: In Junos OS releases prior to 17.2, it is recommended that the number of VCP links should be a factor of 2 (2, 4, 8, or 16). This is due to the hash algorithm that is used to send traffic over the links between members being limited to that factor. If an odd number is used, traffic will not be spread equally. In JunoS release 17.2 or later, RLI31252 removes this requirement because the algorithm has changed. With these later releases, an odd number of links such as 3, 5, 7, or 9 can be used and the traffic would still be distributed as expected.

  • DDoS for VCP links: There are two queues in use on VCP links between members: VC-Chassis-Control-High and VC-Chassis-Control-Low. The high queue is used for VCP keep-alive packets and internal communications, whereas the low queue is used for all control traffic that needs to traverse the links (DHCP, PPPoE, L2TP, and so on).
  • SCFD: Suspicious Control Flow Detection - When enabled globally, it is required that flow-detection be disabled for all the Virtual-Chassis categories.  Refer to the following link for further information and categories: DDOS-Protection-Global-Flow-Detection-Mode-Configuring.
  • Virtual Chassis Heartbeat: It is recommended that heartbeat should be configured to help prevent split-brain (primary-primary) scenarios. A heartbeat is an external hello mechanism that uses the Transmission Control Protocol (TCP) between two members via the management fxp0 interfaces on each primary member. When the actual VCP links are down, the heartbeat TCP connection will stay up, ensuring that the two sides do not each believe that the other is gone, which results in dual primarys. With heartbeats in place, the backup member will go into isolation mode and bring down all the FPCs in the chassis until the VCP links are restored and an adjacency is formed. Additional details on the heartbeat functionality can be found at Configuring a Virtual Chassis Heartbeat Connection.
  • Adding or Removing VCP links: When adding a new VCP link, conversion from a ge/xe/et interface to a VCP port takes place. It is recommended to wait until the process is complete for a single link before adding more links. Do not try to add two links at the same time because it can cause the last link processed to be in a permanently down state. If a link does get into this condition, the recovery procedure is to remove and add the VCP link again. When adding or removing multiple VCP links, the recommendation is to wait ~10 seconds before adding or removing the next link. 

  • Switchover Behavior: Starting with JunoS release 15.1, when a Virtual Chassis Graceful Routing-Engine Switchover (GRES) event occurs between two Routing Engines, the former Primary RE reloads automatically before transitioning to the Standby. This reload is desired because it clears state and avoids complicated Primary-to-Standby transition logic issues.

  • GRES Readiness: To confirm whether the VC system is ready for a Virtual Chassis GRES switchover, run the 'request virtual-chassis' routing-engine master switch check command. If the system is not ready, it will display the reason.

  • Virtual-Chassis reload: To reload all the four REs in a Virtual Chassis, use the 'request system reboot all-member both-routing-engines' command. This will reload all the REs in both systems.

  • Unified In-Service Software Upgrade (ISSU) in Virtual-Chassis: Before starting an ISSU, be aware of the Link Aggregation Control Protocol (LACP) and Bidirectional Forwarding Detection (BFD) behavior changes that occur during an ISSU upgrade. For LACP, the system will automatically change from fast intervals to slow during the ISSU. This means that the connecting device will need to be changed to slow before the ISSU is initiated. BFD timers will automatically increase as well, but BFD will handle updating the peering devices automatically. No user intervention is needed. See Preparing for a Unified ISSU in an MX Series Virtual Chassis for steps to prepare a VC for an ISSU upgrade. 

  • Routing Engine Replacement: If an RE fails and needs to be replaced, the default configuration for an RMAd RE from the factory has VC disabled on it. You must follow these instructions for replacing an RE in order for it to come up in a Virtual Chassis: https://www.juniper.net/documentation/en_US/junos/topics/example/virtual-chassis-mx-series-replacing-routing-engine.html

  • Switch Control Board (SCB) Upgrade: Virtual Chassis Member ID information is stored on the SCBs that are installed in the system. When upgrading all SCBs, be aware that Member ID information may be lost and may need to be defined again. When upgrading SCB modules, see Upgrading an MX Virtual Chassis SCB or SCBE to SCBE2 for instructions.

Preventative Issues

  • Customers with high scale environments running Enhanced Subscriber Management, with or without Virtual Chassis, typically need to make changes to their systems on a regular basis. In many Subscriber Management systems, some aspect of dynamic interface creation is used to terminate subscribers. Interfaces such as et/xe/ge/si/psx/lt may have any combination of subscribers such as DHCPv4/v6, PPPoE, L2TP, and so on. In most cases, an additional layer is used such as VLAN Demux and IP Demux. Some physical interface or IFD configuration changes can have an impact on existing subscribers as well, and are not VC-specific. This means that changing the MTU or the Hierarchical Scheduler (HS) of an existing IFD will cause that IFD to reprogram. This results in brief loss of traffic or in subscribers moving to a terminating state.

  • For Virtual Chassis, client-facing interfaces are typically Aggregate Ethernet (AE) bundles. Making changes to multiple AE interfaces at the same time is not recommended on scaled systems. These changes should work fine, but the risk of unforeseen issues does increase. Consider the following examples that can be problematic with scaled subscribers on an MX-VC with AE-terminated subscribers:

    • Link Flap: AE1 has two legs that consist of ge-0/0/1 (member0) and ge-12/0/1 (member1). All subscriber-related information is mirrored on each FPC, which includes the subscriber Variable Based Flow (VBF), VLAN, CoS, Firewall Filters, and so on. These AE interfaces can scale to thousands of VLAN/Subscriber sessions, making a physical change that causes a flap to create a large amount of churn and work for the system. Although all the changes are updated and processed, there could be an impact with traffic loss and potential subscriber loss if the duration of the impact exceeds keep-alive timers.

    • Class of Service (CoS) Change: Another example that can be problematic is when adding or removing the CoS configuration for the Hierarchical Scheduler or changing the Maximum Transmission Unit (MTU) of an AE bundle. Both actions result in a reprogramming of the interfaces that are built on top of the AE. The amount of work that the system needs to do to process the changes can result in subscribers getting lost. The number of subscribers connected to the AE has a direct impact on whether or not a problem may be seen.  When subscribers start dropping off, the system will get even busier as it starts to clean up interface state, routes, and database entries for each subscriber. If multiple changes to an AE are made in a single commit, the likelihood of problems increases.

    • Recommendation: If more than a single AE needs to be modified where the actions will result in an AE link change, it is recommended that the changes be made during a maintenance window. In addition, the changes should be made to one AE at a time, leaving sufficient time for subscriber sessions to be removed and rebuilt before moving to the next AE that needs to be adjusted. The same rules also apply to interfaces that are not built by using Aggregate Ethernet. This will reduce the chances of running into any unforeseen issues.
    • ​Recommendation: Adding or removing legs to an existing bundle is another potential problem because it can result in reprogramming actions and updates to forwarding that generate a large amount of work for the system. When changing an AE composition, it is recommended to wait at least a minute between changes if multiple updates are being made. This will allow time to clean up sessions on that FPC before adding the leg back into the bundle.

    • Logical Tunnel (​LT) Anchors:  Pseudowire (PS) interfaces that terminate subscribers from an MPLS/L2 environment need an anchor interface. The anchor point could have thousands of VLANs and associated subscribers. When making changes to the underlying LT interface that is tied to the PS, or changing the location of an LT interface, there will be an impact on the system. This can result in traffic loss and subscriber churn.  

    • ​Recommendation: If more than a single LT anchor needs to be updated to a new LT location, it is recommended to carry out the changes during a maintenance window because an impact is expected. Performing changes one at a time and confirming that subscribers recover before moving on to the next one is ideal. This will reduce the chances of running into unforeseen issues. 

For best practices and performance related recommendations for MX devices running Junos OS releases prior to 15.1, see:
 
KB29590 - [Subscriber Management] Maximizing Scaling and Performance for MX Series Virtual Chassis.

 

Modification History:
2019-06-04: ​Added SCFD requirements and link to documentation
2021-03-25: Updated the article terminology to align with Juniper's Inclusion & Diversity initiatives.
Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search