Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[SRX] How to upgrade an SRX cluster with minimal down time?

0

0

Article ID: KB17947 KB Last Updated: 18 Feb 2020Version: 18.0
Summary:

This article provides information about upgrading an SRX cluster by using a minimal down time procedure.

 

Symptoms:

SRX chassis clusters containing mismatched code versions on the two cluster nodes can result in network instability and unpredictable cluster behavior.

This means that to properly upgrade Junos OS on chassis clusters, the following options exist:

  • ISSU (in-service software upgrade)
    • Allows for protected upgrade method in which code version differences may temporarily be seen during the upgrade process
  • ICU (in-band cluster upgrade for SRX-branch devices)
    • Allows for protected upgrade method in which code version differences may temporarily be seen during the upgrade process
  • Dual reboot of both nodes post upgrade
  • Minimal downtime procedure

The goal of this article is to provide information and the procedure for upgrading SRX chassis clusters with minimal downtime as an alternative to ISSU/ICU.

 

Solution:

Caveats and Behaviors

The following can be expected when using the minimal downtime time procedure:

  • Depending on the network configuration, traffic will fail over between devices with minimal packet loss.

  • All existing sessions will require to be rebuilt upon redundancy-group (RG) failover.

    • New sessions for existing traffic once traffic flow matches allowed policies

      • Temporary disabling of TCP SYN and sequence checking can be done to allow TCP traffic to continue passing.

      •  If security is still a concern during the procedure, these features may remain enabled; however, client applications will need to rebuild a new TCP session.

    • Sessions which were using Network Address Translation (NAT) will be rebuilt with new NAT translations.

    • Sessions utilizing Application Layer Gateway (ALG) will require new control channel communication to allow rebuild of associated data sessions.

  • Dynamic routing protocol adjacencies and routes will need to be re-established upon RG0 failover.
  • At no time should the chassis cluster devices have communication over the ‘control’ or ‘fabric’ links when on different Junos OS versions. This can cause negative scenarios such as the following: Routing Engine configuration loss, SPC reboots, IOC reboots, and loss of ability to pass traffic.

    • If the two devices communicate while on different versions, a simultaneous reboot of both devices must be performed after both devices have been updated to same Junos OS version.

  • If the fabric interfaces (fab0 or fab1) are disabled instead of being deleted or moved to new ports, the device will be inoperable after a reboot. 

    • To restore the device, enable the fabric ports and reboot.

  • During the upgrade procedure, there may be configuration discrepancies, which will prevent a successful commit.

    • At the points where this is critical, a commit check has been suggested to ensure that a simultaneous commit occurs.

  • During the testing that is performed by Juniper Networks, failover time of a few seconds was achieved with minimal packet loss.

Upgrade Procedure Overview

For the detailed upgrade procedure, refer to the following detailed direction documents:

NOTE: Primary = Node that is primary for RG0/RG1 at the start of the process

Secondary = Node that is secondary for RG0/RG1 at the start of the process

  1. Disable the network interfaces on the backup device. This is performed to isolate the unit from the network so that it will not impact traffic when the upgrade procedure is in progress.

  2. Adjust configuration for the following:

  • Deactivate preempt for redundancy groups.
  • Deactivate interface and ip-monitoring for redundancy-groups.

  • Disable SYN bit checking and sequence number checking, allowing for TCP traffic to rebuild sessions after failover to secondary device without requiring a 3-way TCP handshake.

  1. Break control and fabric link communication paths by using configuration adjustments or physical cable removals to ensure that the nodes do not communicate with one another while on different Junos OS versions.

Warning: Disabling of fabric links via set interfaces fabX disable is unsupported and may cause device commit or bootup failure.

  1. Upgrade software on the backup device first. When the upgrade is complete, reboot the backup device.

  2. Validate that the backup device is up and available to take over traffic. It can take several minutes, depending on the platform of the system to complete the boot process.

  3. Traffic will now be switched between the two devices by disabling the physical interfaces on the primary device and enabling them on the secondary device at the same time. Traffic will begin to flow on the secondary device.

  4. Ensure that the secondary device is handling traffic by looking at the session table and interface stats to verify traffic is flowing through the device.

  5. Upgrade software on the primary device that is not passing traffic. When the upgrade is complete, reboot the primary device that is not passing traffic.

  6. Validate that the primary device is up and available to take over traffic. It can take several minutes, depending on the platform of the system to complete the boot process.

  7. Reconfigure the control / fabric link for the primary device only and then power down primary device.

NOTE: Please refer to detailed upgrade process documents for specific device processes to prevent nodes from forming full cluster at this step.

  1. Reconfigure or connect the control / fabric link for the secondary device and then boot up the primary device.

  2. Verify that the primary device has booted successfully and that the HA cluster status with the secondary device is successful, including sync of sessions between the nodes.

  3. Enable primary device physical interfaces, although traffic will not automatically fail back to the primary node.

  4. Activate TCP syn-check / sequence and interface monitoring, and preempt if required.

Note: Enabling of preempt may cause data redundancy-groups RG1+ to fail over.

  1. Optional: Manually fail over to the primary device and verify that traffic is passing successfully.

 

Modification History:

2020-02-18: Modified article with information that is relevant and accurate

2019-06-17: As ISSU/ICU is not supported on vSRXs, this method could be used for upgrading vSRX clusters as well and so updated the product category.

2018-12-21:  Updated LICU.PDF document to correct for device name typo on Steps 22a/22b

2018-11-19: Article checked for accuracy and clarity, and the following sentence added in the Upgrade Procedure Overview section:

  • The control and links must be physically disconnected in models other than the SRX5000 Series of devices because they cannot be disabled via configuration.

 

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search