This article provides information about upgrading an SRX cluster by using a minimal down time procedure.
SRX chassis clusters containing mismatched code versions on the two cluster nodes can result in network instability and unpredictable cluster behavior.
This means that to properly upgrade Junos OS on chassis clusters, the following options exist:
- ISSU (in-service software upgrade)
- Allows for protected upgrade method in which code version differences may temporarily be seen during the upgrade process
- ICU (in-band cluster upgrade for SRX-branch devices)
- Allows for protected upgrade method in which code version differences may temporarily be seen during the upgrade process
- Dual reboot of both nodes post upgrade
- Minimal downtime procedure
The goal of this article is to provide information and the procedure for upgrading SRX chassis clusters with minimal downtime as an alternative to ISSU/ICU.
Caveats and Behaviors
The following can be expected when using the minimal downtime time procedure:
- Dynamic routing protocol adjacencies and routes will need to be re-established upon RG0 failover.
-
At no time should the chassis cluster devices have communication over the ‘control’ or ‘fabric’ links when on different Junos OS versions. This can cause negative scenarios such as the following: Routing Engine configuration loss, SPC reboots, IOC reboots, and loss of ability to pass traffic.
Upgrade Procedure Overview
For the detailed upgrade procedure, refer to the following detailed direction documents:
NOTE: Primary = Node that is primary for RG0/RG1 at the start of the process
Secondary = Node that is secondary for RG0/RG1 at the start of the process
-
Disable the network interfaces on the backup device. This is performed to isolate the unit from the network so that it will not impact traffic when the upgrade procedure is in progress.
-
Adjust configuration for the following:
- Deactivate preempt for redundancy groups.
-
Deactivate interface and ip-monitoring for redundancy-groups.
-
Disable SYN bit checking and sequence number checking, allowing for TCP traffic to rebuild sessions after failover to secondary device without requiring a 3-way TCP handshake.
-
Break control and fabric link communication paths by using configuration adjustments or physical cable removals to ensure that the nodes do not communicate with one another while on different Junos OS versions.
Warning: Disabling of fabric links via set interfaces fabX disable
is unsupported and may cause device commit or bootup failure.
-
Upgrade software on the backup device first. When the upgrade is complete, reboot the backup device.
-
Validate that the backup device is up and available to take over traffic. It can take several minutes, depending on the platform of the system to complete the boot process.
-
Traffic will now be switched between the two devices by disabling the physical interfaces on the primary device and enabling them on the secondary device at the same time. Traffic will begin to flow on the secondary device.
-
Ensure that the secondary device is handling traffic by looking at the session table and interface stats to verify traffic is flowing through the device.
-
Upgrade software on the primary device that is not passing traffic. When the upgrade is complete, reboot the primary device that is not passing traffic.
-
Validate that the primary device is up and available to take over traffic. It can take several minutes, depending on the platform of the system to complete the boot process.
-
Reconfigure the control / fabric link for the primary device only and then power down primary device.
NOTE: Please refer to detailed upgrade process documents for specific device processes to prevent nodes from forming full cluster at this step.
-
Reconfigure or connect the control / fabric link for the secondary device and then boot up the primary device.
-
Verify that the primary device has booted successfully and that the HA cluster status with the secondary device is successful, including sync of sessions between the nodes.
-
Enable primary device physical interfaces, although traffic will not automatically fail back to the primary node.
- Activate TCP syn-check / sequence and interface monitoring, and preempt if required.
Note: Enabling of preempt may cause data redundancy-groups RG1+ to fail over.
-
Optional: Manually fail over to the primary device and verify that traffic is passing successfully.
2020-02-18: Modified article with information that is relevant and accurate
2019-06-17: As ISSU/ICU is not supported on vSRXs, this method could be used for upgrading vSRX clusters as well and so updated the product category.
2018-12-21: Updated LICU.PDF document to correct for device name typo on Steps 22a/22b
2018-11-19: Article checked for accuracy and clarity, and the following sentence added in the Upgrade Procedure Overview section: