Knowledge Search


×
 

[QFX/QFabric] Troubleshooting Double HDD Module Failures in a RAID Volume on a QFabric QFX3100 DG

  [KB34768] Show Article Properties


Summary:

Each QFX3100 Director Device on a QFabric system contains two hard disk drives (HDD) modules that are organized as a RAID 1 or mirrored configuration. If one HDD module fails or needs to be replaced, the other HDD module becomes the primary drive in the RAID.

This article focuses on double HDD module failure or a scenario where a customer might want to prepare a new spare Director Device chassis with two new HDD modules.

Note: Independent failure of two drives is extremely rare.

For additional examples covering a single HDD module failure, refer to Troubleshooting HDD Module Failures in a RAID Volume.

Symptoms:

If double HDD module failure occurs in a production Director Device, the Director Device will reboot and seem stuck in the boot process.

Cause:

A possible cause for double HDD module failure might be degraded HDD modules that are at the end of their lifetime. One of the possible use cases for wanting to prepare a spare Director Device might be replacing another Director Device that is currently in production.

In certain upgrade or migration scenarios, it might be desirable to have a spare Director Device already on the initial or target Junos OS version as spare, should something go wrong with the upgrade or migration process. This would significantly reduce the rollback time, for example, as no downgrade or new USB installation should be necessary. It would also significantly reduce the upgrade or migration time, in case something goes wrong during the maintenance window.

Additional use cases include, for example, when a single HDD module fails on a production Director Device and the customer prefers installing Junos OS on a spare DG in parallel, and then replacing the production DG with the new spare DG (as opposed to replacing the failed HDD module and allowing the RAID to synchronize the new drive, which can take up to 17 hours).

The USB Flash Drive creation process for system installation or recovery is very specific. It is recommended using a Linux or Unix system to create the USB Flash Drive. Refer to Performing a QFabric System Recovery Installation on the Director Group for the USB Flash Drive Creation process, specifically the section titled "(Optional) Creating an Emergency Boot Device Using a Juniper Networks External Blank USB Flash Drive."

Solution:

When both HDDs are removed from a QFX3100 Director Device or when both HDD modules fail in a production Director Device, the RAID becomes INACTIVE and is not recognized as storage.

When a new spare QFX3100 Director Device chassis is being prepared with two new (or old) HDD modules, the RAID is INACTIVE. When a RAID is INACTIVE, you can restore it by using an internal controller utility to activate the disk.

In the event that this double HDD module failure occurs in a production QFX3100 Director Device, we first need to proceed with isolating this Director Device.

 

Isolating a Director Device

Before restoring an inactive RAID or restoring a corrupted RAID, intentionally isolate the Director Device.

  1. Gracefully bring down the failing Director Device. See Powering Off a QFX3100 Director Device.
  2. Disconnect the cable in port 0 of the failing Director Device, which connects to the control plane virtual chassis.

  3. Disable the interfaces on the EX4200 or E4300 device that connect the failing Director Device to both Interconnect devices, by using the set interface <ge-x/y/z> disable command.
  • On a QFX3000-M QFabric system, disable ge-0/0/40 and ge0/0/41.

    • Copper or fiber EX Series VC0 interfaces:

      • ge-0/0/20

      • ge-0/0/21

    • Copper or fiber EX Series VC1 interfaces:

      • ge-0/0/22

      • ge-0/0/23

  • On a QFX3000-G QFabric system using copper connections, disable port 40 for Director Group 1 failures or port 41 for Director Group 2 failures.

    • Copper EX Series VC0 interfaces:

      • ge-0/0/40

      • ge-1/0/40

      • ge-2/0/40

    • Copper EX Series VC1 interfaces:

      • ge-0/0/41

      • ge-1/0/41

      • ge-2/0/41

    • Fiber EX Series VC0 interfaces:

      • ge-0/0/22

      • ge-1/0/22

      • ge-2/0/22

    • Fiber EX Series VC1 interfaces:

      • ge-0/0/23

      • ge-1/0/23

      • ge-2/0/23

The Director Device is now isolated from the rest of the QFabric system.

 

Installing an HDD Module in a QFX3100 Director Device

Refer to Installing an HDD Module in a QFX3100 Director Device for more information.

 

Creating a new RAID

Accessing the LSI Driver Utility:

  1. Start the Director Device. You will see the following rapid succession of screens:

 

  1. Press Ctrl+C to interrupt the reboot sequence at the LSI BIOS screen. Refer to the screenshot below:

 

When Ctrl+C is pressed, the LSI BIOS utility will load:

 

  1. LSI BIOS Utility - The default adapter should already be highlighted. Press Enter to open the Adapter Properties screen. You can use the arrow keys to move up and down the different fields on the screen.

  1. Use the arrow keys to highlight RAID Properties. Press Enter to open the New Array Type Options page.

  1. Select "Create IM Volume" and press Enter to delete the existing volume and create a new volume. The new volume information appears on the Create New Array page.

The following example shows that both HDD modules are visible but not part of the RAID.

 

  1. Select the first "No" field in the RAID Disk column and press the Spacebar to change the entry to Yes, to select a disk to become part of the new RAID. The utility gives you the option to overwrite all the data on the drive or to synchronize the data with the data on the other drive.

  1. Select "D" to delete all previous data from the disk and create a new array.

  1. The WARNING screen appears. Press Enter.