This article describes an issue where slow packet drops may be seen on QFX5K platform interfaces with no ​messages to indicate any errors. It also lists the releases in which this issue is fixed.
When transit traffic goes through the QFX5K device, traffic is dropped at a slow rate in the ingress direction even though the total traffic rate does not exceed the interface bandwidth.
In the output of the request pfe execute target fpc0
command "set dcbc bcm \"show c\""
command, the DROP_ING_PKT
counter is seen to increment:
DROP_PKT_ING.xe97 : 1,881,485,062 +896,163 359/s
From the output of the request pfe execute target fpc0
command
"set dcbc bcm \"dump chg port_sp_cntrs_rt_x\""
or the request pfe execute target fpc0
command "set dcbc bcm \"dump chg port_sp_cntrs_rt_y\""
command​, you might observe a higher value in the port_sp_shared_count field, which remains the same even after traffic has stopped, for example:
PORT_SP_CNTRS_RT_X.mmu0[65]: <PORT_SP_SHARED_COUNT_G1=1,PORT_SP_SHARED_COUNT_G0=1,PORT_SP_SHARED_COUNT=0x1ff33,PARITY=1,ECCP=0x6d,ECC=0x2d,DATAWIDTH=1>
This issue is seen to occur only during device reboot or PFE reboot.
During system bring-up, by default, the hardware maps all the PG7 components to the SP0 components for all ports. During interface creation in the Packet Forwarding Engine (PFE) (after reboot or during channelization), the PG7 components re-map to the SP1 components. At this time, some packets may get queued to SP0 before PG7 re-maps to SP1, which will lead to a decrement of the SP1 counter value instead of SP0, thus leading to a negative corrupted value.
When the negative value is presented, the system will think that there is no more shared buffer available and will use only the dedicated buffer for each port. Given this, if there is a burst in incoming traffic, the dedicated buffer will not be able to handle all the traffic and will request the shared buffer, which is considered unavailable. So packets will be dropped, resulting in the incrementing DROP_PKT_ING
counter value.
This issue has been fixed in Junos OS releases 14.1X53-D52, 16.2R3, 18.4R3, 17.4R3, 19.1R3, 19.2R2, 19.3R2, and 19.4R1 via PR1466770.
No known workaround is available for the problem. But there may be a chance of recovery by rebooting the devices although it is not guaranteed.