Support Support Downloads Knowledge Base Juniper Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[MX] Modular Port Concentrator FPC CPU spikes remain elevated

0

0

Article ID: KB37033 KB Last Updated: 24 Jun 2021Version: 1.0
Summary:

VPLS/BRIDGE SMAC learned over cross-PFE AE might bounce between AE member PFEs for a long/infinite time causing MLP-ADD. 
FPC CPU spikes and remain elevated for a long duration. This could impact traffic and cause an outage.

Symptoms:
  1. When the churn is ongoing, the FPC CPU is busy, both in interrupts and in ukernel thread mac_db.
    Example:

    user@Device> show chassis fpc
    
    Temp  CPU Utilization (%)   CPU Utilization (%) Memory    Utilization (%)
    Slot State            (C)  Total  Interrupt      1min   5min   15min DRAM (MB) Heap     Buffer
    0  Online               34     90         39       89     88     88 2048 19 25  <--- 90% CPU high Utilization
    1  Online               33     99         38       99     99     98 2048 19 25  <--- 99% CPU high Utilization
    2  Online               33     15          0       16     16     16 2048 19 13  <--- 15% CPU normal Utilization
  2. Issuing the request pfe execute target command "show sched" shows the Top Thread mac-db CPU is high:

    user@Device> request pfe execute target command "show sched"
    Total system uptime 0+00:00:16, (1717795 ms), 23003 thread dispatches
    
    [...]
      Top Thread:
        pid      = 15
        name     = mac_db
        time     = 3691 ms
        cpu      = 21%
    [...]
  3. The MLP-ADD is increasing during the churn by issuing the request pfe execute target command "show jnh 0 exceptions terse"

    user@Device> request pfe execute target  command "show jnh 0 exceptions terse"
    
    Reason                             Type           Packets      Bytes
    [...]
    mlp pkt                            PUNT(11)      11442540  331833660 <--- rapidly increasing
    [...]
  4. The ddos command below with refresh rate 1 shows the rate is incrementing:

    user@Device> show ddos-protection protocols mlp add | find fpc.slot.1 | grep "(  arrival rate)|(fpc)" | refresh 1
    
          > ---(refreshed at Time)---
          >     FPC slot 1 information:
          >       Received:  257415              Arrival rate:     5577 pps
          >     FPC slot 2 information:
          >       Received:  232521              Arrival rate:     5578 pps
    user@Device> request pfe execute command " show threads" target fpc1 | no-more
      2 L  running   Idle                   320/2048   0/0/128173695 ms 13%
    16 M  asleep    mac_db                2768/8192   0/1/178276153 ms 21%  <--- mac-db high
    62 M  asleep    L2ALM Manager         4240/8192   0/1/45776753 ms  4%
    84 L  asleep    LU Background Service   968/4104   0/1/43136148 ms  4%
    97 M  asleep    Cassis Free Timer      480/4096   0/1/41576795 ms  4%
    112 L  asleep    DDOS Policers         2656/4096   0/3/107855282 ms 11%. <--- ddos Policers rate high
    
    
    user@Device> request pfe execute command " show heap" target fpc1 | no-more
    ================ fpc1 ================
    SENT: Ukern command: show heap
     
    ID        Base      Total(b)       Free(b)       Used(b)   %   Name
    --  ----------   -----------   -----------   -----------  ---   -----------
    0    46350e80    1850404100    1596987596     253416504   13  Kernel
    1    b47ffb88      67108860      50177660      16931200   25  LAN buffer
    2    bcdfffe0      52428784      52428784             0    0  Blob
    3    b87ffb88      73400316      73400316             0    0  ISSU scratch

    The following might bee seen on the message log:
    LOG: Debug] l2alm_timer_handle_expiry_event context status 1 failed
    LOG: Debug] l2alm_timer_handle_expiry_event context status 1 failed
    LOG: Debug] l2alm_timer_handle_expiry_event context status 1 failed
    LOG: Debug] l2alm_timer_handle_expiry_event context status 1 failed
    LOG: Debug] l2alm_timer_handle_expiry_event context status 1 failed
    LOG: Debug] l2alm_timer_handle_expiry_event context status 1 failed
    jddosd[20700]: %DAEMON-4-DDOS_PROTOCOL_VIOLATION_SET: Warning: Host-bound traffic for protocol/exception  MLP:add exceeded its allowed bandwidth at fpc 0 for 52 times, started at "date and time to date and time"
    jddosd[20700]: %DAEMON-4-DDOS_PROTOCOL_VIOLATION_CLEAR: INFO: Host-bound traffic for protocol/exception MLP:add has returned to normal. Its allowed bandwith was exceeded at fpc 0 for 51 times, from date and time to date and time
    jddosd[20700]: %DAEMON-4-DDOS_PROTOCOL_VIOLATION_CLEAR: INFO: Host-bound traffic for protocol/exception MLP:add has returned to normal. Its allowed bandwith was exceeded at fpc 1 for 45 times, from from "date and time to date and time"
    jddosd[20700]: %DAEMON-4-DDOS_PROTOCOL_VIOLATION_SET: Warning: Host-bound traffic for protocol/exception  MLP:add exceeded its allowed bandwidth at fpc 1 for 46 times, started at "Date"
Cause:

In Modular Port Concentrator (MPC) if aggregated Ethernet (AE) has more than one child link hosted on different Packet Forwarding Engines (PFEs), and the previous device load balances the stream (based on L3 or L4 fields) to multiple links of the AE, due to a software defect, the source media access control (MAC) address learned from cross-PFE AE might continue bouncing between AE member PFEs for a long or infinite time and may cause a MLP-ADD storm.

The churn might cause the increased CPU utilization at MPC in the interrupt context. The duration of this churn depends on the rate of the incoming flow from the affected source MAC and on how busy the ukernel is with other routines. The duration varies.

This issue may occur if the following conditions are met:

  • AE has more than one child links hosted on different PFEs.
  • Previous router load balance the stream (based on L3 or L4 fields) to multiple links of the AE hosted on different PFEs.
  • High input traffic (usually ~50K PPS or more) to the AE.
Solution:

Workaround: 

  • If encountered in production - Bouncing (disable and enable interface) the AE side-link interfaces will temporarily reduce CPU usage. CPU can spike again anytime. 
  • Configure all member links of the AE on the same PFE. 
    • CAUTION: This may not be desired because it reduces redundancy 
  • Change the load balancing method on the router so that the same source MAC will not be received on multiple PFEs of the AE.
    • Remove payload from the hashing configuration.
    • CAUTION: This may not be desired because there is potential for traffic imbalance. 


To permanently resolve this issue, upgrade Junos to one of the following fixed releases:

  • 17.1R2-S8
  • 17.1R3
  • 17.2R2
  • 17.2R1-S3
  • 17.3R1
  • 17.4R1
  • 17.2X75-D50
Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search