Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[SRX] Low Throughput and Packet Loss observed for some sessions when Chassis Cluster fails over. How to troubleshoot?

0

0

Article ID: KB24232 KB Last Updated: 14 Mar 2014Version: 6.0
Summary:

Low throughput and packet loss observed for some sessions when Chassis Cluster fails over. How to troubleshoot?

Symptoms:

Low throughput and packet loss observed for some sessions when Chassis Cluster fails over. How to troubleshoot?

  • Only "some" old sessions have packet drop, rest of the old sessions are OK.
  • New sessions have no problem.
  • For the problematic session, lots of retransmissions are seen on the packet capture taken at the source.
Cause:
 
Solution:
How to Troubleshoot?
====================

Note: The following troubleshooting is done a SRX3400 device. The same is applicable to Branch devices as well with minor changes in the troubleshooting commands. For troubleshooting purposes, the destination port 3389 is chosen. It is for Microsoft Terminal Server (RDP) flow.
  1. Find out which FPC is anchoring the session:
  2. SRX> show security flow session destination-port 3389
    node0:
    --------------------------------------------------------------------------

    Flow Sessions on FPC5 PIC0:
    Total sessions: 0

    Flow Sessions on FPC6 PIC0:

    Session ID: 120021158, Policy name: WIN-MGMT-1/263, State: Active, Timeout: 7200, Valid
    In: 192.168.206.36/63640 --> 192.168.178.85/3389;tcp, If: reth0.1660, Pkts: 178, Bytes: 34254
    Out: 192.168.178.85/3389 --> 192.168.206.36/63640;tcp, If: reth1.1760, Pkts: 195, Bytes: 50499
    Total sessions: 1

    node1:
    --------------------------------------------------------------------------

    Flow Sessions on FPC5 PIC0:
    Total sessions: 0

    Flow Sessions on FPC6 PIC0:

    Session ID: 120012942, Policy name: WIN-MGMT-1/263, State: Backup, Timeout: 57550, Valid
    In: 192.168.206.36/63640 --> 192.168.178.85/3389;tcp, If: reth0.1660, Pkts: 0, Bytes: 0
    Out: 192.168.178.85/3389 --> 192.168.206.36/63640;tcp, If: reth1.1760, Pkts: 0, Bytes: 0
    Total sessions: 1

    Based on the above output, the session for port 3389 is anchored on FPC6 PIC0 for both nodes.

  3. Run the following command to see more details from the CLI:
  4. SRX> request pfe execute target tnp tnp-name node0.fpc6.pic0 command "show usp flow session dest-port 3389"
    SENT: Ukern command: show usp flow session dst-port 3389
    GOT:
    GOT:
    GOT: Session Id: 21158, CP session Id: 29189, Policy: 263, Timeout: 7200s, state: 3, flags: 8000040/4000000/8003
    GOT: Active, failover cnt 1, sync id 0x118052a6, retry cnt 0
    GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0 <<<< Notice the WSF value
    GOT: np=0xf nh: 0xb01a3c2, tunnel_info: 0x0, pkts: 478, bytes: 48332
    GOT: pmtu : 1500, tunnel pmtu: 0
    GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0 <<<< Notice the WSF value
    GOT: np=0xf nh: 0xb2f93c2, tunnel_info: 0x0, pkts: 490, bytes: 74987
    GOT: pmtu : 1500, tunnel pmtu: 0
    LOCAL: End of file

    Note: For a branch device, use the following command:  

    SRX> request pfe execute target fwdd command "show usp flow session dest-port 3389"

    Notice the session is "Active" on node0.fpc6.pic0.

  5. Then run the same command on node1.fpc6.pic0:
SRX> request pfe execute target tnp tnp-name node1.fpc6.pic0 command "show usp flow session dst-port 3389"
SENT: Ukern command: show usp flow session dst-port 3389
GOT:
GOT:
GOT: Session Id: 1183, CP session Id: 1800, Policy: 263, Timeout: 57604s, state: 3, flags: 10000040/4000000/8003
GOT: Backup, failover cnt 0, sync id 0x118052a6, retry cnt 0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0 <<<< Notice the WSF value
GOT: np=0xff nh: 0x0, tunnel_info: 0x0, pkts: 0, bytes: 0
GOT: pmtu : 1500, tunnel pmtu: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0 <<<< Notice the WSF value
GOT: np=0xff nh: 0x0, tunnel_info: 0x0, pkts: 0, bytes: 0
GOT: pmtu : 1500, tunnel pmtu: 0
LOCAL: End of file


Good Case:
==========

The key to notice in the above session outputs for both nodes is the "wsf" number. These numbers must exactly match on Active and Passive nodes.

node0.fpc6.pic0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0

node1.fpc6.pic0

GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0

Bad Case:
=========

In the problem situation you would see something like this;

node0.fpc6.pic0

GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0

node1.fpc6.pic0

GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 2, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 6, diff: 0
 

Notice the wsf values are 0 and 8 in the Active node and 2 and 6 in the Backup. This is a problem.


Conclusion: 
=========

What does this all mean?

It means that the server is setting the WSF (Window Size Factor) bit ON and the SRX needs to match it while transitioning a flow. As long as the flow is going through the Primary (Active) node, it matches the server requirement and sets the WSF bit to "0 and 8".

However, there was a failure to sync up these values onto the Backup Node, when the Active went down and the Backup took over, these WSF values had a mismatch with the server. Hence resulting in several retransmits and "TCP sequence number out of window" errors. This eventually affects the throughput.


Here is another way to look at it. Based on the interfaces in question i.e. reth0.1660 and reth1.1760 in the above example, run the following command:

SRX> show interfaces reth0.1660 extensive detail | match "TCP sequence number out of window"
TCP sequence number out of window: 37789

Run the same command a couple of times to see the increment:

TCP sequence number out of window: 37832
TCP sequence number out of window: 37863
TCP sequence number out of window: 37891

Similar results will be seen on the outgoing interface i.e. reth1.1760

TCP sequence number out of window: 392865
TCP sequence number out of window: 393266
TCP sequence number out of window: 393701

On a good system no errors should be seen;

 
TCP sequence number out of window: 0


Workaround:
===========

If the flow has a mismatch in the WSF values and several "TCP sequence number out of window" errors, then it is a known bug PR695629.
Disable the tcp sequence check in SRX.

# set security flow tcp-session no-sequence-check

Note that this will only apply to a new flow from the same source and destination.

Important:  Refer to the Requirements in the following link for understanding the circumstances for disabling TCP packet security check
http://www.juniper.net/techpubs/en_US/junos11.4/topics/example/session-tcp-packet-security-check-for-srx-series-disabling-cli.html


Recommendation:
===============

Upgrade the Chassis Cluster to any of the following releases:

10.4R9 or later
11.1R7 or later

Check for the recommended release for SRX platform:

KB21476 - JTAC Recommended Junos Software Versions

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search