Support Support Downloads Knowledge Base Juniper Support Portal Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[SRX] Troubleshooting major RAID failure alarm in Standalone and Cluster Chassis environment

2

1

Article ID: KB35685 KB Last Updated: 09 Jun 2021Version: 6.0
Summary:

A redundant array of independent disk (RAID) is an organization of multiple disks exhibiting characteristics such as fault tolerance and high performance. A RAID array is used in servers for data storage and to replicate data among multiple hard disk drives.

A RAID failure alarm is seen in devices when the Hard Disk Drive (HDD) has been compromised and these alarms could be caused by various factors. Two very well known factors are Inconsistency and Loss of Synchronization in the RAID disks.

This article explains how to troubleshoot the alarm when it is caused by these two factors and how to get the HDD back to a Stable state.

Symptoms:
  • Users may encounter these two alarms on SRX devices if any of the causes mentioned are encountered.
jtac@root> show chassis alarms
node0:
--------------------------------------------------------------------------
1 alarms currently active
Alarm time               Class  Description
2020-04-10 18:16:26 UTC  Major  Raid Failure, Status = inconsistent
 
jtac@root> show chassis alarms
node0:
--------------------------------------------------------------------------
1 alarms currently active
Alarm time               Class  Description
2020-04-10 18:16:26 UTC  Major  Raid Failure, Status = nosync
 
  • The following logs are also reported:
<chassisd>
Apr 10 18:16:01 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:02 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:03 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:12 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:13 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:22 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:23 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:25 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:26 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
 
  • RAID status will show as inconsistent:
jtac@root> show chassis raid members
node0:
--------------------------------------------------------------------------
/dev/sda: isw, "isw_bhbefbgdg", GROUP, ok, 468862126 sectors, data@ 0
node1:
--------------------------------------------------------------------------
/dev/sdb: isw, "isw_ciiibfcje", GROUP, ok, 468862126 sectors, data@ 0
/dev/sda: isw, "isw_ciiibfcje", GROUP, ok, 468862126 sectors, data@ 0

{primary:node0}
jtac@root> show chassis raid status
node0:
--------------------------------------------------------------------------
Raid Status: inconsistent
node1:
--------------------------------------------------------------------------
Raid Status: ok
Cause:

Possible causes are:

  • Hard Disk Failure 
  • Transient issue with RAID volume
Solution:
1. Reboot the device.

2. Re-partition the SRX device. (Note: Bypass this step if the SRX model is SRX4100/4200)
    
    Re-install the device with the currently running version. Run the upgrade command with “partition” option to format and re-partition the media before installation:        
    > request system software add <package-name> no-validate partition  
3. Reconfigure RAID via any of the following two methods.

==========================================================================
Method #1 (recommended): Log in to the Linux host and rebuild RAID with the dmraid command.
==========================================================================

        a. Log in to Linux.

            > start shell user root
            # ssh -JU __juniper_private4__ 192.168.1.1

      In some cases, the above command may return the following error:

            root@SRX4100% ssh -JU __juniper_private4__ 192.168.1.1
            user@192.168.1.1's password:
            Permission denied, please try again.

      In this scenario, use rsh instead of ssh:

           root@SRX4100% rsh -JU __juniper_private4__ 192.168.1.1
        b. Check RAID status and note superset name.
           root:~# dmraid -s

            *** Group superset isw_bgeieiffch          
            --> Active Subset
            name   : isw_bgeieiffch_Volume0
            size   : 445432064
            stride : 128
            type   : mirror
            status : ok
            subsets: 0
            devs   : 2
            spares : 0
        c. Rebuild RAID.
            root:~# dmraid -R isw_bgeieiffch   <-- Cannot be used to rebuild when RAID status is "ok"
        Note: This step can take up to 15 minutes. Till then the 'dmraid-s' will not show status as ok. Please use 'dmsetup status' to verify if sync is in progress:
            root:~# dmsetup status

            isw_eceaedjjih_Raid1p1: 0 1984376 linear
            vg0_vjunos-lv_junos_recovery: 0 20709376 linear
            isw_eceaedjjih_Raid1: 0 445435400 mirror 2 8:16 8:0 789/3399 1 AA 1 core  <-- When this completes, it should be 3399/3399
            isw_eceaedjjih_Raid1p4: 0 147982424 linear
            vg0_vjunos-lv_junos: 0 41426944 linear
            isw_eceaedjjih_Raid1p3: 0 253679688 linear
            isw_eceaedjjih_Raid1p2: 0 21138672 linear
        d. Verify RAID status.
            root:~# dmraid -s

==========================================================================
Method #2: Rebuild RAID from BIOS.
==========================================================================

        Note: This will result in complete data loss and USB re-install will have to be performed.

        a. Reboot the device and during bootup, keep pressing "Ctrl + I."
        b. After entering the following menu, users will be able to delete and re-create the RAID volume, but data will be wiped out.
           Copyright(C) 2003-15 Intel Corporation.  All Rights Reserved.
            **********************************[ MAIN MENU ]*********************************
            *         1.  Create RAID Volume             3.  Reset Disks to Non-RAID       *
            *         2.  Delete RAID Volume             4.  Mark Disks as Spare           *
            *                                            5.  Exit                          *
            ***************************[ DISK/VOLUME INFORMATION ]**************************
            * RAID Volumes:                                                                *
            * ID   Name              Level             Strip      Size Status      Bootable*
            * 0    Volume0           RAID1(Mirror)     N/A     212.4GB Normal        Yes   *
            *                                                                              *
            * Physical Devices:                                                            *
            * ID   Device Model     Serial #                     Size Type/Status(Vol ID)  *
            * 1    M500IT_MTFDDAK24 162212E665B8              223.5GB Member Disk(0)       *
            * 2    M500IT_MTFDDAK24 162212E65EF1              223.5GB Member Disk(0)       *
            *                                                                              *
            *                                                                              *
            *                                                                              *
            ********************************************************************************
            [**]-Select   [ESC]-Exit   [ENTER]-Select Menu

            Note: In issue state, you might not see any RAID volume and/or Physical device might show as 'Offline member' or 'Unknown Disk'.

        c. Select option 3 using arrow keys or numbers to navigate and reset both SSDs to Non-RAID.

d. Next, create a new RAID Volume using option 1. Keep the new volume name as Volume0. Before moving ahead, you should see 1 RAID volume and both SSDs should show 'Member Disk(0)'.

        e. Next, create a bootable Junos USB using the method given in KB27369.

        f. Plug the USB into SRX and power cycle the device. This should take you to Linux Installation menu:
                
            Copyright (C) 1994-2011 H. Peter Anvin et al
            Rebooting...
            Starting Linux Installation .......

            x   Juniper Linux Installer - (c) Juniper Networks 2014    x
            tqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu
            x Reboot                                                   x
            x Install Juniper Linux Platform                           x
            x Boot to host shell [debug]                               x
            x                                                          x
            x
            x                                                          x
            x                                                          x
            x                                                          x
            x                                                          x
            x
            x                                                          x
            x                                                          x
            mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj

            Press [Tab] to edit options

        g. Select option 'Install Juniper Linux Platform' using arrow keys to navigate. This will start the Junos installation process. After successful install you will see below:
            ---------------------------------------------
            [  OK  ] Installation for Junos
            Rebooting the system to complete the installation
            ---------------------------------------------
            Remove the USB and Press [Enter] key to Reboot...

        h. If install is unsuccessful and you end up in 'root@(none):/tmp/root#' prompt, then use 'exit'. This will restart the install process. If the above steps do not help, replace the device.

4. Once the alarm clears, you can verify the status again.
    jtac@root> show chassis alarms no-forwarding
    No alarms currently active​

    jtac@root> request chassis routing-engine hard-disk-test
    content_copy
    zoom_out_map
    RAID INFORMATION
    RAID device path: /dev/ad4
    Firmware Version: 11594
    RAID controller s/n: 12345678
    RAID Chip ID: 123
    RAID policy: SAFE
    Drive0 model: WDC WD123AAJS-4567A0
    Drive1 model: WDC WD345JD-18MSA1
    Drive0 s/n:      WD-WCAT30214999
    Drive1 s/n:      WD-WMAM9DTK4111
    Drive0 capacity: 74(GB)
    Drive1 capacity: 74(GB)
    RAID STATUS
    Drive0: On-line
    Drive1: On-line
    Number of partitions: 1
    Size of Partitions:
        Partition 0: 74(GB)
    RAID Status: Healthy

    {primary:node0}
    jtac@root> show chassis raid status
    node0:
    -------------------------------------
    Raid Status: ok

    node1:
    -------------------------------------
    Raid Status: ok

    Note: All RAID commands can be executed with meaningful outputs only when RAID is configured prior.
Modification History:
2020-07-20: Removed RE from cause. Added note to solution step 2 (NA for SRX4100/4200). Added more information to solution step 3 section c. Added steps to 'Rebuild RAID from BIOS' section of solution step 3
2020-12-24: Formatted the document for a better understanding. It was requested over case 2020-1125-0357.
2021-05-18: Removed SRX4600 from products list since it has no RAID
2021-05-26: Added alternate command in case ssh doesn't work in Method #1 (recommended): Log in to the Linux host and rebuild RAID with the dmraid command.

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search