Knowledge Search


×
 

How to upgrade a SRX cluster with minimal downtime?

  [KB17947] Show Article Properties


Summary:
This article provides information on how to upgrade a SRX cluster with minimal downtime.
Symptoms:
At no time can a cluster have mismatched code versions. This can result in network instability and unpredictable behavior. This means that to properly upgrade a cluster without ISSU (not supported on SRX Branch devices), you would need to ensure that both nodes are rebooted and do not attempt to connect to each other with different Junos code versions.

Zero downtime is not currently possible on SRX clusters. The goal of this article is to provide a means to upgrade an SRX cluster with the minimum amount of downtime possible. The following events can be expected during this process:

  • All sessions, which have network address translation, will be lost.
  • All sessions utilizing ALG (Like FTP, SIP, and so on) will be lost.
  • Dynamic routing protocol adjacencies will need to be re-established upon failover between the devices.
  • All other existing sessions will be able to fail between devices.
  • Depending on the network configuration, traffic will failover between devices with minimal packet loss.
Cause:
 
Solution:

There are several caveats to be aware before implementing this procedure:

  • Synchronizing the routing engines must always be performed with a reboot on one or both units. If the two units are connected on the control plane (Control Port), without one unit being rebooted, it is possible for one routing engine to overwrite the other RE's configuration; which causes service outages.
  • Synchronizing the data plane must always be performed with a reboot on one or both units. If the two units are connected on the data plane (Fab Port), without one unit being rebooted, it is possible for SPUs to enter a negative state. There is less risk with this, than with the control plane; but it should also be avoided.
  • At no time should two devices of different Junos versions communicate across control or fabric links. This can cause negative scenario cases to occur. The following may occur: Routing Engine configuration loss, SPC reboots, IOC reboots, and loss of ability to pass traffic. If two devices of different versions communicate and strange behavior occurs, a simultaneous reboot of both devices must be performed to reset them. Before the reboot, it is necessary to first upgrade/downgrade the software of the unit, so that both devices are on the same version.
  • If the fabric interfaces (fab0 or fab1) or their associated physical interfaces are disabled, the device needs to be rebooted to enable them. The device will become inoperable. To restore the device, enable the fabric ports and reboot. There is no other way to make the device eligible to enter the cluster, besides a reboot.
  • Traffic can be failed over between devices, during the upgrade, with little traffic loss. This is done in a method, which is similar to many stateful firewalls. All non-network address translation (NAT) and non-application layer gateway (ALG) (FTP only) traffic will failover to the other device. New sessions will be created on the backup unit. This will be done by not checking to see if the connection is new. This may be considered less secure. Also, if security is still a concern, these session checking features can be enabled after the upgrade procedure is completed. Also it is possible to perform the upgrade, without disabling the TCP syn and sequence checking; although this will cause all sessions to end and they need to be restarted by the client applications.
  • The Dynamic routing state is not synchronized between the two cluster members. Upon failover between the devices, new neighbor relationships for all protocols will need to be re-established.
  • During testing performed by Juniper Networks, few second failover times were achieved. All non-NAT and non-ALG sessions were transferred between devices.
  • During the upgrade, there may be configuration discrepancies, which will prevent a successful commit. At the points where this is critical, a commit check is suggested to ensure that a simultaneous commit needs to occur.

Upgrade Procedure Overview:

  1. Disable network interfaces on the backup device. This is performed to isolate the unit from the network, so it will not impact traffic, when the upgrade procedure is in progress.
  2. Disable SYN bit checking and TCP Sequence number checking. This allows the secondary firewall to take over stateful, non-NAT, and non-ALG traffic; without requiring a 3-way TCP handshake.
  3. The control and fabric links must be disabled or disconnected between the two devices. This will ensure that the nodes, which are running different Junos versions, will not communicate to each other.
  4. Upgrade software on backup firewall first. When upgrading, use the no-validate option to ignore the errors, which will occur for configuration bits that are related to the other cluster members. Once the upgrade is complete, reboot the backup device.
  5. Validate if the backup firewall is up and available to take over traffic. It can take several minutes, depending on the platform of the system, to complete the boot process.
  6. Correct the control port and fabric port configuration, if necessary, only on the backup device. This will prepare the device to synchronize later in the process.
  7. The backup firewall is ready to take over for the primary. This is one of the crucial steps in the procedure. The traffic will now be switched between the two devices, by disabling the physical interfaces on the primary and enabling them on the secondary device at the same time. Traffic will immediately begin to flow on the secondary device.
  8. Ensure that the secondary device is handling the traffic, by looking at the session table and checking if the traffic is flowing through the device and that new sessions are being created.
  9. Upgrade software on the now isolated primary firewall. When performing the upgrade, use the no-validate option to ignore the errors, which will occur for configuration bits that are related to the other cluster members. Once the upgrade is complete, reboot the primary device.
  10. Validate if the primary firewall is up and available to take over traffic. It can take several minutes, depending on the platform of the system, to complete the boot process.
  11. At this point, the primary firewall is ready to take over for the backup. This is the second crucial step in the procedure. The traffic will now be switched between the two devices, by disabling the physical interfaces on the backup and enabling them on the primary device, at the same time. Traffic will immediately begin to flow on the primary device.
  12. Ensure that the primary device is handling the traffic, by looking at the session table and checking if traffic is flowing through the device and that new sessions are being created.
  13. Now it is time to synchronize the cluster. First reboot the backup device. When it is rebooting, set the correct sync ports on the primary device. When the backup device comes back, it will synchronize with the primary device.
  14. Once the backup firewall is up and ready to process traffic, enable its physical interfaces. It will not process traffic; but be ready to process traffic, in the event that a failure occurs.
  15. If SYN Check and Sequence Check were disabled, before starting the activity, then re-enable them; if required.

Detailed upgrade procedure:

Related Links: