Knowledge Search


×
 

[SRX] How to upgrade an SRX cluster with minimal down time?

  [KB17947] Show Article Properties


Summary:

This article provides information about upgrading an SRX cluster with minimal down time.

 

Symptoms:

At no time can a cluster have mismatched code versions. This can result in network instability and unpredictable behavior. This means that to properly upgrade a cluster without in-service software upgrade (ISSU) (not supported on SRX Branch devices), you would need to ensure that both nodes are rebooted. Do not attempt to connect one node to the other with different Junos OS code versions.

Zero down time is currently not possible on SRX clusters. The goal of this article is to provide information about upgrading an SRX cluster with minimal down time as possible. The following events can be expected during this process:

  • All sessions, which have Network Address Translation, will be lost.

  • All sessions utilizing Application Layer Gateway (ALG) (such as FTP, SIP, and so on) will be lost.

  • Dynamic routing protocol adjacencies will need to be re-established upon failover between the devices.

  • All other existing sessions will be able to fail between devices.

  • Depending on the network configuration, traffic will fail over between devices with minimal packet loss.

 

Solution:

There are several caveats to be aware of before implementing this procedure:

  • Synchronizing the routing engines must always be performed with a reboot on one or both units. If the two units are connected on the control plane (Control Port), without one unit being rebooted, it is possible that one routing engine may overwrite the other RE's configuration, which causes service outages.

  • Synchronizing the data plane must always be performed with a reboot on one or both units. If the two units are connected on the data plane (Fab Port), without one unit being rebooted, it is possible that the SPUs may enter a negative state. There is less risk with this than with the control plane, but it should also be avoided.

  • At no time should two devices of different Junos OS versions communicate across the control or fabric links. This can cause negative scenarios such as the following: Routing Engine configuration loss, SPC reboots, IOC reboots, and loss of ability to pass traffic. If two devices of different Junos OS versions communicate and strange behavior occurs, a simultaneous reboot of both devices must be performed to reset them. Before the reboot, it is necessary to first upgrade/downgrade the software of the unit, so that both devices are on the same version.

  • If the fabric interfaces (fab0 or fab1) or their associated physical interfaces are disabled, the device needs to be rebooted to enable them. The device will become inoperable. To restore the device, enable the fabric ports and reboot. There is no other way to make the device eligible to enter the cluster than a reboot.

  • Traffic can be failed over between devices, during the upgrade, with little traffic loss. This is done in a method, which is similar to many stateful firewalls. All non-network address translation (NAT) and non-application layer gateway (ALG) (FTP only) traffic will fail over to the other device. New sessions will be created on the backup unit. This is done by not checking to see if the connection is new. However, this may be considered less secure. If security is still a concern, these session checking features can be enabled after the upgrade procedure is completed. It is also possible to perform the upgrade without disabling the TCP SYN and sequence checking, although this will cause all sessions to end, requiring them to be restarted by the client applications.

  • The dynamic routing state is not synchronized between the two cluster members. Upon failover between the devices, new neighbor relationships for all protocols will need to be re-established.

  • During the testing that is performed by Juniper Networks, failover time of a few seconds was achieved. All non-NAT and non-ALG sessions were transferred between devices.

  • During the upgrade, there may be configuration discrepancies, which will prevent a successful commit. At the points where this is critical, a commit check is suggested to ensure that a simultaneous commit occurs.

 

Upgrade Procedure Overview

  1. Disable the network interfaces on the backup device. This is performed to isolate the unit from the network so that it will not impact traffic when the upgrade procedure is in progress.
  2. Disable the SYN bit checking and TCP sequence number checking. This allows the secondary firewall to take over stateful, non-NAT, and non-ALG traffic without requiring a 3-way TCP handshake.

  3. The control links must be physically disconnected in models other than the SRX5000 Series of devices because they cannot be disabled via configuration. The fabric links must be disabled or disconnected between the two devices. This will ensure that the nodes, which are running different Junos OS versions, will not communicate with one another.

  4. Upgrade the software on the backup firewall first. When upgrading, use the no-validate option to ignore any errors, which will occur for the configuration bits that are related to the other cluster members. When the upgrade is complete, reboot the backup device.

  5. Validate that the backup firewall is up and available to take over traffic. It can take several minutes, depending on the platform of the system, to complete the boot process.

  6. Correct the control port and fabric port configuration, if necessary, only on the backup device. This will prepare the device to synchronize later in the process.

  7. The backup firewall is ready to take over for the primary. This is one of the crucial steps in the procedure. The traffic will now be switched between the two devices by disabling the physical interfaces on the primary device and enabling them on the secondary device at the same time. Traffic will immediately begin to flow on the secondary device.

  8. Ensure that the secondary device is handling traffic by looking at the session table and checking whether traffic is flowing through the device and whether new sessions are being created.

  9. Upgrade the software on the now isolated primary firewall. When performing the upgrade, use the no-validate option to ignore any errors, which will occur for the configuration bits that are related to the other cluster members. When the upgrade is complete, reboot the primary device.

  10. Validate that the primary firewall is up and available to take over traffic. It can take several minutes, depending on the platform of the system, to complete the boot process.

  11. At this point, the primary firewall is ready to take over from backup. This is the second crucial step in the procedure. The traffic will now be switched between the two devices by disabling the physical interfaces on the backup device and enabling them on the primary device, at the same time. Traffic will immediately begin to flow on the primary device.

  12. Ensure that the primary device is handling traffic by looking at the session table and checking whether traffic is flowing through the device and that new sessions are being created.

  13. Now it is time to synchronize the cluster. First reboot the backup device. When it is rebooting, set the correct sync ports on the primary device. When the backup device comes back, it will synchronize with the primary device.

  14. When the backup firewall is up and ready to process traffic, enable its physical interfaces. It will not process traffic but will be ready to do so in the event that a failure occurs.

  15. If SYN Check and Sequence Check were disabled, before starting the activity, enable them again, if required.

For the detailed upgrade procedure, refer to LICU.pdf.

 

Modification History:

2018-11-19: Article checked for accuracy and clarity, and the following sentence added in the Upgrade Procedure Overview section:

  • The control and links must be physically disconnected in models other than the SRX5000 Series of devices because they cannot be disabled via configuration.

 

Related Links: