How to Troubleshoot?
====================
Note: The following troubleshooting is done a SRX3400 device. The same is applicable to Branch devices as well with minor changes in the troubleshooting commands. For troubleshooting purposes, the destination port 3389 is chosen. It is for Microsoft Terminal Server (RDP) flow.
- Find out which FPC is anchoring the session:
SRX> show security flow session destination-port 3389
node0:
--------------------------------------------------------------------------
Flow Sessions on FPC5 PIC0:
Total sessions: 0
Flow Sessions on FPC6 PIC0:
Session ID: 120021158, Policy name: WIN-MGMT-1/263, State: Active, Timeout: 7200, Valid
In: 192.168.206.36/63640 --> 192.168.178.85/3389;tcp, If: reth0.1660, Pkts: 178, Bytes: 34254
Out: 192.168.178.85/3389 --> 192.168.206.36/63640;tcp, If: reth1.1760, Pkts: 195, Bytes: 50499
Total sessions: 1
node1:
--------------------------------------------------------------------------
Flow Sessions on FPC5 PIC0:
Total sessions: 0
Flow Sessions on FPC6 PIC0:
Session ID: 120012942, Policy name: WIN-MGMT-1/263, State: Backup, Timeout: 57550, Valid
In: 192.168.206.36/63640 --> 192.168.178.85/3389;tcp, If: reth0.1660, Pkts: 0, Bytes: 0
Out: 192.168.178.85/3389 --> 192.168.206.36/63640;tcp, If: reth1.1760, Pkts: 0, Bytes: 0
Total sessions: 1
Based on the above output, the session for port 3389 is anchored on FPC6 PIC0 for both nodes.
- Run the following command to see more details from the CLI:
SRX> request pfe execute target tnp tnp-name node0.fpc6.pic0 command "show usp flow session dest-port 3389"
SENT: Ukern command: show usp flow session dst-port 3389
GOT:
GOT:
GOT: Session Id: 21158, CP session Id: 29189, Policy: 263, Timeout: 7200s, state: 3, flags: 8000040/4000000/8003
GOT: Active, failover cnt 1, sync id 0x118052a6, retry cnt 0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0 <<<< Notice the WSF value
GOT: np=0xf nh: 0xb01a3c2, tunnel_info: 0x0, pkts: 478, bytes: 48332
GOT: pmtu : 1500, tunnel pmtu: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0 <<<< Notice the WSF value
GOT: np=0xf nh: 0xb2f93c2, tunnel_info: 0x0, pkts: 490, bytes: 74987
GOT: pmtu : 1500, tunnel pmtu: 0
LOCAL: End of file
Note: For a branch device, use the following command:
SRX> request pfe execute target fwdd command "show usp flow session dest-port 3389"
Notice the session is "Active" on node0.fpc6.pic0.
- Then run the same command on node1.fpc6.pic0:
SRX> request pfe execute target tnp tnp-name node1.fpc6.pic0 command "show usp flow session dst-port 3389"
SENT: Ukern command: show usp flow session dst-port 3389
GOT:
GOT:
GOT: Session Id: 1183, CP session Id: 1800, Policy: 263, Timeout: 57604s, state: 3, flags: 10000040/4000000/8003
GOT: Backup, failover cnt 0, sync id 0x118052a6, retry cnt 0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0 <<<< Notice the WSF value
GOT: np=0xff nh: 0x0, tunnel_info: 0x0, pkts: 0, bytes: 0
GOT: pmtu : 1500, tunnel pmtu: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0 <<<< Notice the WSF value
GOT: np=0xff nh: 0x0, tunnel_info: 0x0, pkts: 0, bytes: 0
GOT: pmtu : 1500, tunnel pmtu: 0
LOCAL: End of file
Good Case:
==========
The key to notice in the above session outputs for both nodes is the "wsf" number. These numbers must exactly match on Active and Passive nodes.
node0.fpc6.pic0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0
node1.fpc6.pic0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0
Bad Case:
=========
In the problem situation you would see something like this;
node0.fpc6.pic0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 0, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 8, diff: 0
node1.fpc6.pic0
GOT: (in)* : 192.168.206.36/63640 -> 192.168.178.85/3389;6, If: reth0.1660 (24586), flag: 0021, wsf: 2, diff: 0
GOT: (out) : 192.168.206.36/63640 <- 192.168.178.85/3389;6, If: reth1.1760 (32775), flag: 0020, wsf: 6, diff: 0
Notice the wsf values are 0 and 8 in the Active node and 2 and 6 in the Backup. This is a problem.
Conclusion:
=========
What does this all mean?
It means that the server is setting the WSF (Window Size Factor) bit ON and the SRX needs to match it while transitioning a flow. As long as the flow is going through the Primary (Active) node, it matches the server requirement and sets the WSF bit to "0 and 8".
However, there was a failure to sync up these values onto the Backup Node, when the Active went down and the Backup took over, these WSF values had a mismatch with the server. Hence resulting in several retransmits and "TCP sequence number out of window" errors. This eventually affects the throughput.
Here is another way to look at it. Based on the interfaces in question i.e. reth0.1660 and reth1.1760 in the above example, run the following command:
SRX> show interfaces reth0.1660 extensive detail | match "TCP sequence number out of window"
TCP sequence number out of window: 37789
Run the same command a couple of times to see the increment:
TCP sequence number out of window: 37832
TCP sequence number out of window: 37863
TCP sequence number out of window: 37891
Similar results will be seen on the outgoing interface i.e. reth1.1760
TCP sequence number out of window: 392865
TCP sequence number out of window: 393266
TCP sequence number out of window: 393701
On a good system no errors should be seen;
TCP sequence number out of window: 0
Workaround:
===========
If the flow has a mismatch in the WSF values and several "TCP sequence number out of window" errors, then it is a known bug PR695629.
Disable the tcp sequence check in SRX.
# set security flow tcp-session no-sequence-check
Note that this will only apply to a new flow from the same source and destination.
Important: Refer to the Requirements in the following link for understanding the circumstances for disabling TCP packet security check
http://www.juniper.net/techpubs/en_US/junos11.4/topics/example/session-tcp-packet-security-check-for-srx-series-disabling-cli.html
Recommendation:
===============
Upgrade the Chassis Cluster to any of the following releases:
10.4R9 or later
11.1R7 or later
Check for the recommended release for SRX platform:
KB21476 - JTAC Recommended Junos Software Versions