tl;dr. Is there a way to properly boot a software-based RAID1 with a missing or failed drive (that wasn't failed by the user first)?
To be clear, booting a software-based RAID1 without a hard drive is possible IF you properly fail the drive before rebooting. I know this is subjective, but this doesn’t seem like a plausible solution nor an acceptable answer. For example; A facility takes a power hit and the hard drive fails at the same time the power goes out. Trying to boot with a degraded hard drive that wasn’t “properly” failed will result the system dropping into emergency mode.
I’ve read many posts from across here and other forums all recommending that you install grub on all partitions, or rebuild grub manually, add nofail
to the /etc/fstab
options, or other seemingly simple solutions; but the reality is that none of these recommendations have worked.
While I’ve come to terms with this not being possible, something about this doesn’t rest easy. So, I’m seeing if anyone else has this problem or has a solution to this issue.
My environment:
I have an older motherboard that doesn't support UEFI, so I have booting legacy mode/MBR.
OS:
cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 7.6 (Maipo)
Kernel:
uname –r
3.10.0-957.el7.x86_64
mdadm:
mdadm –version
mdadm – v4.1-rc1 2018-03-22
My RAID is RAID1 across three drives. (sda,sdb,sdc
) and there are 4 partitions
md1 - /boot
md2 - /home
md3 - /
md4 - swap
I have installed grub on all partitions and ensured that all boot partitions have the boot flag.
fdisk /dev/sd[a,b,c]
all show a *
in the boot field next to the appropriate partition
-- and --
grub2-install /dev/sd[a,b,c]
(as separate commands, with ‘successfully installed’ results).
Replicating the problem:
- Power off the system with all drives assigned to the RAID and the RAID fully operational.
- Remove hard drive
- Power system up
Results:
The system will boot past grub. Gdm will attempt to display the login screen but after about 20 seconds, it will fail and drop to an emergency console. There are many missing parts from a “normal” system. For instance; /boot and /etc do not exist. There doesn't appear to be any kernel panic messages or issues displayed in dmesg
.
Again, the key here is; the RAID has to be fully assembled, power down and remove a drive. If you properly fail a drive and remove it from the RAID, then you can boot without a drive present.
Example:
mdadm --manage /dev/md[1,2,3,4] --fail /dev/sda[1,2,3,4]
(as separate commands)
mdadm --manage /dev/md[1,2,3,4] --remove /dev/sda[1,2,3,4]
(as separate commands)
I know this seems trivial, but I have yet to find a viable solution to booting a system with a degraded RAID1. You would think that this should be a simple problem with a simple solution, but this does not appear to be the case.
Any help, input, or suggestions would be greatly appreciated.