Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[SRX] Troubleshooting major RAID failure alarm in Standalone and Cluster Chassis environment

0

0

Article ID: KB35685 KB Last Updated: 24 Jul 2020Version: 2.0
Summary:

A redundant array of independent disk (RAID) is an organization of multiple disks exhibiting characteristics such as fault tolerance and high performance. A RAID array is used in servers for data storage and to replicate data among multiple hard disk drives.

A RAID failure alarm is seen in devices when the Hard Disk Drive (HDD) has been compromised and these alarms could be caused by various factors. Two very well known factors are Inconsistency and Loss of Synchronization in the RAID disks.

This article explains how to troubleshoot the alarm when it is caused by these two factors and how to get the HDD back to a Stable state.

Symptoms:

Customers may encounter these two alarms on SRX devices if any of the causes mentioned are encountered.

jtac@root> show chassis alarms
node0:
--------------------------------------------------------------------------
1 alarms currently active
Alarm time               Class  Description
2020-04-10 18:16:26 UTC  Major  Raid Failure, Status = inconsistent
 
jtac@root> show chassis alarms
node0:
--------------------------------------------------------------------------
1 alarms currently active
Alarm time               Class  Description
2020-04-10 18:16:26 UTC  Major  Raid Failure, Status = nosync
 

The following logs are also reported:

<chassisd>
Apr 10 18:16:01 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:02 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:03 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:12 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:13 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:22 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Apr 10 18:16:23 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:25 chassisd[5415]: CHASSISD_SNMP_TRAP7: SNMP trap generated: Hard Disk Failed (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName node0 Routing Engine, jnxFruType 6, jnxFruSlot 0)
Apr 10 18:16:26 /etc/mount-crash: Mounting $:linux_host_addr /var/tmp on /var/host-mnt//var/tmp: mount_nfs failed/timed out
Cause:
  • Hard Disk Failure 

  • Transient issue with RAID volume

Solution:

To Troubleshoot

Verify RAID status by using the following commands:

jtac@root> show chassis raid members
node0:
--------------------------------------------------------------------------
/dev/sda: isw, "isw_bhbefbgdg", GROUP, ok, 468862126 sectors, data@ 0

node1:
--------------------------------------------------------------------------
/dev/sdb: isw, "isw_ciiibfcje", GROUP, ok, 468862126 sectors, data@ 0
/dev/sda: isw, "isw_ciiibfcje", GROUP, ok, 468862126 sectors, data@ 0

{primary:node0}
jtac@root> show chassis raid status
node0:
--------------------------------------------------------------------------
Raid Status: inconsistent

node1:
--------------------------------------------------------------------------
Raid Status: ok

Perform the below steps to clear the alarm.

  1. Reboot the device.

  2. Re-partition the SRX device: (not applicable for SRX4100/4200)

    1. Re-install the device with the currently running version.
    2. Run the upgrade command with “partition” option to format and re-partition the media before installation:        
> request system software add <package-name> no-validate partition  
  1. Reconfigure RAID. Use one of the following two options:

Log in to the Linux host and rebuild RAID with the dmraid command (recommended):

  1. Log in to Linux.
start shell user root
ssh -JU __juniper_private4__ 192.168.1.1
  1. Check RAID status and note superset name.

root:~# dmraid -s
*** Group superset isw_bgeieiffch          

--> Active Subset
name   : isw_bgeieiffch_Volume0
size   : 445432064
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0

  1. Rebuild RAID.
root:~# dmraid -R isw_bgeieiffch        <<<< Cannot be used to rebuild when RAID status is "ok"

This step can take up to 15 minutes. Till then the 'dmraid-s' will not show status as ok. Please use 'dmsetup status' to verify if sync is in progress:

root:~# dmsetup status

isw_eceaedjjih_Raid1p1: 0 1984376 linear 
vg0_vjunos-lv_junos_recovery: 0 20709376 linear 
isw_eceaedjjih_Raid1: 0 445435400 mirror 2 8:16 8:0 789/3399 1 AA 1 core  <<<<<< When this completes, it should be 3399/3399
isw_eceaedjjih_Raid1p4: 0 147982424 linear 
vg0_vjunos-lv_junos: 0 41426944 linear 
isw_eceaedjjih_Raid1p3: 0 253679688 linear 
isw_eceaedjjih_Raid1p2: 0 21138672 linear 
  1. Verify RAID status.

    root:~# dmraid -s

OR

Rebuild RAID from BIOS:

NOTE: This will result in complete data loss and USB re-install will have to be performed.

  • Reboot the device and during bootup, keep pressing "Ctrl + I."
  • After entering the following menu, users will be able to delete and re-create the RAID volume, but data will be wiped out.

               Copyright(C) 2003-15 Intel Corporation.  All Rights Reserved.
               **********************************[ MAIN MENU ]*********************************
               *         1.  Create RAID Volume             3.  Reset Disks to Non-RAID       *
               *         2.  Delete RAID Volume             4.  Mark Disks as Spare           *
               *                                            5.  Exit                          *
               ***************************[ DISK/VOLUME INFORMATION ]**************************
               * RAID Volumes:                                                                *
               * ID   Name              Level             Strip      Size Status      Bootable*
               * 0    Volume0           RAID1(Mirror)     N/A     212.4GB Normal        Yes   *
               *                                                                              *
               * Physical Devices:                                                            *
               * ID   Device Model     Serial #                     Size Type/Status(Vol ID)  *
               * 1    M500IT_MTFDDAK24 162212E665B8              223.5GB Member Disk(0)       *
               * 2    M500IT_MTFDDAK24 162212E65EF1              223.5GB Member Disk(0)       *
               *                                                                              *
               *                                                                              *
               *                                                                              *
               ********************************************************************************
               [**]-Select   [ESC]-Exit   [ENTER]-Select Menu

 

  • In issue state, you might not see any RAID volume and/or Physical device might show as 'Offline member' or 'Unknown Disk'.
  • Select option 3 using arrow keys or numbers to navigate and reset both SSDs to Non-RAID.
  • Next, create a new RAID Volume using option 1. Keep the new volume name as Volume0. Before moving ahead, you should see 1 RAID volume and both SSDs should show 'Member Disk(0)'.
  • Next, create a bootable Junos USB using method given here: https://kb.juniper.net/InfoCenter/index?page=content&id=KB27369
  • Plug the USB into SRX and power cycle the device. This should take you to Linux Installation menu:
 
Copyright (C) 1994-2011 H. Peter Anvin et al
Rebooting...
Starting Linux Installation .......

x   Juniper Linux Installer - (c) Juniper Networks 2014    x
tqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu
x Reboot                                                   x
x Install Juniper Linux Platform                           x
x Boot to host shell [debug]                               x
x                                                          x
x
x                                                          x
x                                                          x
x                                                          x
x                                                          x
x
x                                                          x
x                                                          x
mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj

Press [Tab] to edit options
         
  • Select option 'Install Juniper Linux Platform' using arrow keys to navigate. This will start the Junos installation process. After successful install you will see below:
---------------------------------------------
[  OK  ] Installation for Junos
Rebooting the system to complete the installation
---------------------------------------------
Remove the USB and Press [Enter] key to Reboot...
  • If install is unsuccessful and you end up in 'root@(none):/tmp/root#' prompt, then use 'exit'. This will restart the install process.  
  1. If the above steps do not help, replace the device.

Once the alarm clears, you can verify the status again:

jtac@root> show chassis alarms no-forwarding

No alarms currently active​

jtac@root> request chassis routing-engine hard-disk-test
content_copy
zoom_out_map
RAID INFORMATION
RAID device path: /dev/ad4
Firmware Version: 11594
RAID controller s/n: 12345678
RAID Chip ID: 123
RAID policy: SAFE

Drive0 model: WDC WD123AAJS-4567A0
Drive1 model: WDC WD345JD-18MSA1
Drive0 s/n:      WD-WCAT30214999
Drive1 s/n:      WD-WMAM9DTK4111
Drive0 capacity: 74(GB)
Drive1 capacity: 74(GB)

RAID STATUS
Drive0: On-line
Drive1: On-line
Number of partitions: 1
Size of Partitions:
    Partition 0: 74(GB)
RAID Status: Healthy

{primary:node0}
jtac@root> show chassis raid status
node0:
--------------------------------------------------------------------------
Raid Status: ok

node1:
--------------------------------------------------------------------------
Raid Status: ok

Note: All RAID commands can be executed with meaningful outputs only when RAID is configured prior.

 

Modification History:
2020-07-20: Removed RE from cause. Added note to solution step 2 (NA for SRX4100/4200). Added more information to solution step 3 section c. Added steps to 'Rebuild RAID from BIOS' section of solution step 3

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search