A redundant array of independent disk (RAID) is an organization of multiple disks exhibiting characteristics such as fault tolerance and high performance. A RAID array is used in servers for data storage and to replicate data among multiple hard disk drives.
A RAID failure alarm is seen in devices when the Hard Disk Drive (HDD) has been compromised and these alarms could be caused by various factors. Two very well known factors are Inconsistency and Loss of Synchronization in the RAID disks.
This article explains how to troubleshoot the alarm when it is caused by these two factors and how to get the HDD back to a Stable state.
- Users may encounter these two alarms on SRX devices if any of the causes mentioned are encountered.
jtac@root> show chassis alarms
node0:
--------------------------------------------------------------------------
1 alarms currently active
Alarm time Class Description
2020-04-10 18:16:26 UTC Major Raid Failure, Status = inconsistent
jtac@root> show chassis alarms
node0:
--------------------------------------------------------------------------
1 alarms currently active
Alarm time Class Description
2020-04-10 18:16:26 UTC Major Raid Failure, Status = nosync
- The following logs are also reported:
<chassisd>
Apr 10 18:16:01 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:02 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:03 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:12 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:13 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:22 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:23 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:25 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:26 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
- RAID status will show as inconsistent:
jtac@root> show chassis raid members
node0:
--------------------------------------------------------------------------
/dev/sda: isw, "isw_bhbefbgdg", GROUP, ok, 468862126 sectors, data@ 0
node1:
--------------------------------------------------------------------------
/dev/sdb: isw, "isw_ciiibfcje", GROUP, ok, 468862126 sectors, data@ 0
/dev/sda: isw, "isw_ciiibfcje", GROUP, ok, 468862126 sectors, data@ 0
{primary:node0}
jtac@root> show chassis raid status
node0:
--------------------------------------------------------------------------
Raid Status: inconsistent
node1:
--------------------------------------------------------------------------
Raid Status: ok
1. Reboot the device.
2. Re-partition the SRX device. (
Note: Bypass this step if the SRX model is SRX4100/4200)
Re-install the device with the currently running version. Run the upgrade command with “
partition” option to format and re-partition the media before installation:
> request system software add <package-name> no-validate partition
3. Reconfigure RAID via any of the following two methods.
==========================================================================
Method #1 (recommended): Log in to the Linux host and rebuild RAID with the
dmraid command.
==========================================================================
a. Log in to Linux.
> start shell user root
# ssh -JU __juniper_private4__ 192.168.1.1
b. Check RAID status and note superset name.
root:~# dmraid -s
*** Group superset isw_bgeieiffch
--> Active Subset
name : isw_bgeieiffch_Volume0
size : 445432064
stride : 128
type : mirror
status : ok
subsets: 0
devs : 2
spares : 0
c. Rebuild RAID.
root:~# dmraid -R isw_bgeieiffch <-- Cannot be used to rebuild when RAID status is "ok"
Note: This step can take up to 15 minutes. Till then the 'dmraid-s' will not show status as ok. Please use 'dmsetup status' to verify if sync is in progress:
root:~# dmsetup status
isw_eceaedjjih_Raid1p1: 0 1984376 linear
vg0_vjunos-lv_junos_recovery: 0 20709376 linear
isw_eceaedjjih_Raid1: 0 445435400 mirror 2 8:16 8:0 789/3399 1 AA 1 core <-- When this completes, it should be 3399/3399
isw_eceaedjjih_Raid1p4: 0 147982424 linear
vg0_vjunos-lv_junos: 0 41426944 linear
isw_eceaedjjih_Raid1p3: 0 253679688 linear
isw_eceaedjjih_Raid1p2: 0 21138672 linear
d. Verify RAID status.
root:~# dmraid -s
==========================================================================
Method #2: Rebuild RAID from BIOS.
==========================================================================
Note: This will result in complete data loss and USB re-install will have to be performed.
a. Reboot the device and during bootup, keep pressing "
Ctrl + I
."
b. After entering the following menu, users will be able to delete and re-create the RAID volume, but data will be wiped out.
Copyright(C) 2003-15 Intel Corporation. All Rights Reserved.
**********************************[ MAIN MENU ]*********************************
* 1. Create RAID Volume 3. Reset Disks to Non-RAID *
* 2. Delete RAID Volume 4. Mark Disks as Spare *
* 5. Exit *
***************************[ DISK/VOLUME INFORMATION ]**************************
* RAID Volumes: *
* ID Name Level Strip Size Status Bootable*
* 0 Volume0 RAID1(Mirror) N/A 212.4GB Normal Yes *
* *
* Physical Devices: *
* ID Device Model Serial # Size Type/Status(Vol ID) *
* 1 M500IT_MTFDDAK24 162212E665B8 223.5GB Member Disk(0) *
* 2 M500IT_MTFDDAK24 162212E65EF1 223.5GB Member Disk(0) *
* *
* *
* *
********************************************************************************
[**]-Select [ESC]-Exit [ENTER]-Select Menu
Note: In issue state, you might not see any RAID volume and/or Physical device might show as 'Offline member' or 'Unknown Disk'.
c. Select option 3 using arrow keys or numbers to navigate and reset both SSDs to Non-RAID.
d. Next, create a new RAID Volume using option 1. Keep the new volume name as Volume0. Before moving ahead, you should see 1 RAID volume and both SSDs should show 'Member Disk(0)'.
e. Next, create a bootable Junos USB using the method given in
KB27369.
f. Plug the USB into SRX and power cycle the device. This should take you to Linux Installation menu:
Copyright (C) 1994-2011 H. Peter Anvin et al
Rebooting...
Starting Linux Installation .......
x Juniper Linux Installer - (c) Juniper Networks 2014 x
tqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu
x Reboot x
x Install Juniper Linux Platform x
x Boot to host shell [debug] x
x x
x
x x
x x
x x
x x
x
x x
x x
mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj
Press [Tab] to edit options
g. Select option 'Install Juniper Linux Platform' using arrow keys to navigate. This will start the Junos installation process. After successful install you will see below:
---------------------------------------------
[ OK ] Installation for Junos
Rebooting the system to complete the installation
---------------------------------------------
Remove the USB and Press [Enter] key to Reboot...
h. If install is unsuccessful and you end up in '
root@(none):/tmp/root#
' prompt, then use 'exit'. This will restart the install process. If the above steps do not help, replace the device.
4. Once the alarm clears, you can verify the status again.
jtac@root> show chassis alarms no-forwarding
No alarms currently active​
jtac@root> request chassis routing-engine hard-disk-test
content_copy
zoom_out_map
RAID INFORMATION
RAID device path: /dev/ad4
Firmware Version: 11594
RAID controller s/n: 12345678
RAID Chip ID: 123
RAID policy: SAFE
Drive0 model: WDC WD123AAJS-4567A0
Drive1 model: WDC WD345JD-18MSA1
Drive0 s/n: WD-WCAT30214999
Drive1 s/n: WD-WMAM9DTK4111
Drive0 capacity: 74(GB)
Drive1 capacity: 74(GB)
RAID STATUS
Drive0: On-line
Drive1: On-line
Number of partitions: 1
Size of Partitions:
Partition 0: 74(GB)
RAID Status: Healthy
{primary:node0}
jtac@root> show chassis raid status
node0:
-------------------------------------
Raid Status: ok
node1:
-------------------------------------
Raid Status: ok
Note: All RAID commands can be executed with meaningful outputs only when RAID is configured prior.
2020-07-20: Removed RE from cause. Added note to solution step 2 (NA for SRX4100/4200). Added more information to solution step 3 section c. Added steps to 'Rebuild RAID from BIOS' section of solution step 3
2020-12-24: Formatted the document for a better understanding. It was requested over case 2020-1125-0357.