Support Support Downloads Knowledge Base Juniper Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

How to recover unresponsive/fluctuating Junos VM on Qfabric DG after NSSU or DG replacement

0

0

Article ID: KB37175 KB Last Updated: 25 Sep 2021Version: 1.0
Summary:

Users may face unresponsive/fluctuating VMs (virtual machine) in QFX3100 devices after or during activities such as NSSU, DG replacement, etc.

This article provides different ways (non-intrusive and intrusive) to bring it into a stable state.

Note: Total expected VMs [2 NNG VMs,2 FM VMs,2 FC VMs, and 1 DRE VM] in Qfabric DGs

[root@dg0 ~]# lsvm
NODE ACTIVE TAG UUID
dg0 1 _DCF_default___NW-INE-0_RE1_ 995df774-4f9a-11e8-a800-8fcb1a959581 -- NNG
dg0 1 _DCF_default___RR-INE-1_RE0_ 96f5749e-4f9a-11e8-9b51-23945e21f13a -- FC
dg1 1 _DCF_default___NW-INE-0_RE0_ 982868d0-4f9a-11e8-819b-cf29433d3495 -- NNG
dg1 1 _DCF_default___RR-INE-0_RE0_ 95c2da08-4f9a-11e8-870a-43c0d1518e4b -- FC
dg0 1 _TAG_DCF_ROOT_RE0_ 4523d5f2-4f9a-11e8-a150-57e133531c4f           -- FM
dg1 1 _TAG_DCF_ROOT_RE1_ 454cf252-4f9a-11e8-99c4-c3dd72e94e35           -- FM
dg0 1 _TAG_DRE_ 4a65f9dc-4f9a-11e8-b044-fb9bf05faf47                    -- DRE
Symptoms:

After or during activities such s NSSU, DG replacement, etc., any VM may not come up or fluctuate continuously while checking VM status using the 'lsvm' shell command.

Example:

NNG VM on DG0 is down. The 'lsvm' command was used multiple times:

[root@dg0 ~]# lsvm
NODE ACTIVE TAG UUID
--- 0 _DCF_default___NW-INE-0_RE1_ 9fd92740-5206-11e8-abbb-2fc3b5195a38  <--- NNG on DG0 is not up
dg0 1 _DCF_default___RR-INE-1_RE0_ 96f5749e-4f9a-11e8-9b51-23945e21f13a

dg1 1 _DCF_default___NW-INE-0_RE0_ 982868d0-4f9a-11e8-819b-cf29433d3495
dg1 1 _DCF_default___RR-INE-0_RE0_ 95c2da08-4f9a-11e8-870a-43c0d1518e4b
dg0 1 _TAG_DCF_ROOT_RE0_ 4523d5f2-4f9a-11e8-a150-57e133531c4f
dg1 1 _TAG_DCF_ROOT_RE1_ 454cf252-4f9a-11e8-99c4-c3dd72e94e35
dg0 1 _TAG_DRE_ 4a65f9dc-4f9a-11e8-b044-fb9bf05faf47


Logs collected to verify the symptom:

  • /var/log directory of DG and the affected VM (If accessible)
  • /vmm/log  directory from the affected DG

On ccif_server.log in /vmm/log folder, the following messages are seen:

VM:  982868d0-4f9a-11e8-819b-cf29433d3495, STATE: unresponsive, NODE-STATE: Activating
 Enter Destroy VM: 982868d0-4f9a-11e8-819b-cf29433d3495
 Exit Destroy VM: 982868d0-4f9a-11e8-819b-cf29433d3495
 VM 982868d0-4f9a-11e8-819b-cf29433d3495 failed. Lifetime fails: 2, fails: 2, max_fails: 5, threshold: 3600, window: 39
 VM 982868d0-4f9a-11e8-819b-cf29433d3495 is being reactivated

This indicates a new VM was created but the VM Monitor thread used to check the state of the VM is not getting a response from it. 

After verifying the CCIF_Server logs from the vmm_daemon.log under /vmm/log, the CCIF (Compute Cluster Infrastructure) service creates a new instance with a different serial and monitor port.

qemu_cmd: /vmm/bin/qemu.kvm -S -hda /vmm/data/live_disks/982868d0-4f9a-11e8-819b-cf29433d3495-clone-disk.img   -serial telnet:dg0:15042,nowait,server -monitor tcp:dg0:20042,server,nowait,nodelay -s -p telnet:dg0:25042,nowait,server -m 2048m -smp 1 -vnc dg0:42   -net nic,model=e1000,vlan=0,macaddr=f4:b5:2f:b8:5f:f9 -net tap,vlan=0,script=/vmm/bin/netscripts/net1-ifup,downscript=no  -net nic,model=e1000,vlan=1,macaddr=f4:b5:2f:b8:5f:f8 -net tap,vlan=1,script=/vmm/bin/netscripts/net1-ifup,downscript=no -L /vmm/bin/bios --uuid 982868d0-4f9a-11e8-819b-cf29433d3495 -drive 

This process continues for approximately 20-30 mins until a response from the VM is received. If no response from the VM, then use the methods below to recover.
The different recovery steps mentioned here are to be performed equentially from non-intrusive to intrusive methods until the VM gets spawned and stable.

Solution:

Recovery-Plan 1: [non-intrusive]

  1. Deactivate VM:

    [root@dg0]# ccif_vm_deactivate -u 982868d0-4f9a-11e8-819b-cf29433d3495

  2. Check if is deactivated using 'lsvm' command:

    --- 1 _DCF_default___NW-INE-0_RE1_ 982868d0-4f9a-11e8-819b-cf29433d3495 <<<
  3. Activate the VM:

    [root@dg0 ~]# ccif_vm_activate -u 982868d0-4f9a-11e8-819b-cf29433d3495
    Activate: 982868d0-4f9a-11e8-819b-cf29433d3495
    Server returned: success (0)
  4. Wait for some time and verify the VM is spawned using 'lsvm' command.

Recovery-Plan 2: [non-intrusive]

Proceed if the above method does not correct the VM issue. These steps will deactivate the VM and activate it back along with CCIF-Restart.

  1. Deactive VM (follow Recovery-Plan 1, step 1 and 2)

  2. Activate VM along with CCIF-Restart (To activate VM, Follow Recovery-Plan 1 step 3)

    [root@dg0 ~]# service ccif restart
    Stop CCIF server (pid 8386): [ OK ]
    CCIF server running (pid 8386)
  3. Wait 3-4 minutes for CCIF Server restart to get completed on DG-0

    [root@dg0 ~]# service ccif status
    CCIF server running (pid 5291)

Recovery-Plan 3:[non-intrusive]

This method destroys the fluctuating VM and allow it to re-spawned automatically.

  1. Find VM UUID:

    [root@dg0 ~]# lsvm
    NODE ACTIVE TAG UUID
    dg0 1 _DCF_default___NW-INE-0_RE1_ 995df774-4f9a-11e8-a800-8fcb1a959581
    dg0 1 _DCF_default___RR-INE-1_RE0_ 96f5749e-4f9a-11e8-9b51-23945e21f13a
    dg1 1 _DCF_default___NW-INE-0_RE0_ 982868d0-4f9a-11e8-819b-cf29433d3495
    dg1 1 _DCF_default___RR-INE-0_RE0_ 95c2da08-4f9a-11e8-870a-43c0d1518e4b
    dg0 1 _TAG_DCF_ROOT_RE0_ 4523d5f2-4f9a-11e8-a150-57e133531c4f
    dg1 1 _TAG_DCF_ROOT_RE1_ 454cf252-4f9a-11e8-99c4-c3dd72e94e35
    dg0 1 _TAG_DRE_ 4a65f9dc-4f9a-11e8-b044-fb9bf05faf47
  2. Destroy the VM:

    [root@dg0 ~]# ccif_client -f destroy -u 995df774-4f9a-11e8-a800-8fcb1a959581
  3. One NNG-VM is missing:

    [root@dg0 ~]# lsvm
    NODE ACTIVE TAG UUID
    dg0 1 _DCF_default___RR-INE-1_RE0_ 96f5749e-4f9a-11e8-9b51-23945e21f13a
    dg1 1 _DCF_default___NW-INE-0_RE0_ 982868d0-4f9a-11e8-819b-cf29433d3495
    dg1 1 _DCF_default___RR-INE-0_RE0_ 95c2da08-4f9a-11e8-870a-43c0d1518e4b
    dg0 1 _TAG_DCF_ROOT_RE0_ 4523d5f2-4f9a-11e8-a150-57e133531c4f
    dg0 1 _TAG_DRE_ 4a65f9dc-4f9a-11e8-b044-fb9bf05faf47
  4. Login to FM-0 and kill vccpd:

    qfabric-admin@FM-0> show system processes | grep vccpd
    2007 ?? S< 10:01.55 /usr/sbin/vccpd -N
    qfabric-admin@FM-0>
    qfabric-admin@FM-0> start shell
    root@FM-0:RE:0%
    root@FM-0:RE:0%
    root@FM-0:RE:0% kill -9 2007
  5. Run 'lsvm' to verify of the VM re-spawned:

    Re-spawning back

    [root@dg0 ~]# lsvm
    NODE ACTIVE TAG UUID
    --- 0 _DCF_default___NW-INE-0_RE1_ 9fd92740-5206-11e8-abbb-2fc3b5195a38
    dg0 1 _DCF_default___RR-INE-1_RE0_ 96f5749e-4f9a-11e8-9b51-23945e21f13a
    dg1 1 _DCF_default___NW-INE-0_RE0_ 982868d0-4f9a-11e8-819b-cf29433d3495
    dg1 1 _DCF_default___RR-INE-0_RE0_ 95c2da08-4f9a-11e8-870a-43c0d1518e4b
    dg0 1 _TAG_DCF_ROOT_RE0_ 4523d5f2-4f9a-11e8-a150-57e133531c4f
    dg1 1 _TAG_DCF_ROOT_RE1_ 454cf252-4f9a-11e8-99c4-c3dd72e94e35
    dg0 1 _TAG_DRE_ 4a65f9dc-4f9a-11e8-b044-fb9bf05faf47

Recovery-Plan 4 (Intrusive)

  1. Deactivate and activate clusvcadm:

    [root@dg0 ~]#clusvcadm -d pbccif_svc0
  2. Wait 2 minutes, then re-enable clusvcadm:

    [root@dg0 ~]#clusvcadm -e pbccif_svc0

Recovery-Plan 5 (Intrusive, last-resort)

  1. This will destroy all the VMs and rebuild again once restarted with the command:

    [root@dg0 ~]#ccif_reset wipe
  2. Wait 2 minutes then re-start ccif:

    [root@dg0 ~]#ccif_reset start
Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search