Score:0

How should the EFI System partition be made redundant without using hardware RAID?

pe flag

What is BCP for making the EFI System partition redundant without using hardware RAID?

If I create 3x EFI System partitions on different devices and then backup any changes made to the primary (mounted at /boot/efi) to the backup devices (mounted at /boot/efi-[bc]):

  • Will the system still boot if the primary device fails, i.e. will it select one of the backup EFI system partitions?
  • Will the system select an EFI System partition deterministically when it boots, i.e. must changes to the primary be replicated on the backups before the next reboot?

Is there a better approach such that the system will still boot if the primary device fails?

Score:1
za flag
  1. UEFI specification lacks any knowledge about software RAID. It is known deficiency.

I'd speculate probably it's because it was largely influenced by Microsoft guys who weren't able to create a reliable software RAID array in Windows, and they don't know it is possible to make array out of partitions with simple superblock without special internal structure (Windows only can build arrays out of discs converted to "dynamic" logical disk manager or storage spaces format).

  1. You can make several ESPs on different devices and sync them manually.

For example, if you install Proxmox VE on ZFS "software RAID", it'll create several ESPs, and install special "hook" which runs after kernel, bootloader and other boot-related stuff updates, and that hook makes sure all ESPs are kept in sync.

  1. For the backup ESP to take over if the primary device fails, you should set up UEFI boot entries for all your ESPs. In Linux it's done like this:
efibootmgr -c -d /dev/sdb -l \\EFI\\DEBIAN\\GRUBX64.EFI -L debian-sdb
efibootmgr -c -d /dev/sdc -l \\EFI\\DEBIAN\\GRUBX64.EFI -L debian-sdc
efibootmgr -c -d /dev/sdd -l \\EFI\\DEBIAN\\GRUBX64.EFI -L debian-sdd
efibootmgr -c -d /dev/sda -l \\EFI\\DEBIAN\\GRUBX64.EFI -L debian-sda

This is the real example from one of my managed systems. It assumes ESPs are first partitions of each disk. This should be done after you synced contents of your ESPs. efibootmgr -v will confirm that all boot entries that you create like this use different devices.

See also: https://askubuntu.com/questions/66637/can-the-efi-system-partition-be-raided

Score:0
cn flag

Each system searches for EFI partitions on the specified boot device. As long as these are updated correctly it should be able to boot then. Beside of this a distributed boot manager is a different story of course.

I have created a gist to see a setup in a systemd environment which would sync all these partitions on system shutdown:

https://gist.github.com/thhart/35f6e4e715c70c2cbe7c5846311d1f9f

Score:0
nc flag

The contents of the EFI partition should be relatively stable, so manually cloning changes to other copies on other disks after updates should be fine. And even if changes are not cloned, old copies might be OK as long as they are not too many revisions behind.

Will the system boot off of an alternate EFI? That's a harder question. Most modern bios versions do support multiple boot devices and may try them all in sequence until one works. So then you just have to make sure they are all there and in the correct order. You may need to manually run the linux command to update the EFI bootloader list and order.

However, it might be better to not have it autoboot on failure. If the primary EFI disk fails, you may want to manually boot and attempt repairs anyway. But having the backup EFI even if it isn't in the boot order should make recovery a lot easier.

An alternate viewpoint -- if a disk in a raid system is going to fail, it is likely to fail when the system is up. If you detect this condition before the next boot, you can easily activate one of your backup EFI copies (and maybe even make it primary) until the failed disk is replaced.

anx avatar
fr flag
anx
*"likely to fail when the system is up"* - I suspect that empirical truth is just an artefact of failure statistics generally dealing with large buyers maximizing usage: being always up. When rebooting at a rate to make this question relevant, that might not be true.
user10489 avatar
nc flag
anx: Agree totally. Except that powered off drives that are not moving are unlikely to fail. But if you don't check them before powering off or rebooting, then this is certainly a factor.
user10489 avatar
nc flag
Put it another way. Common disk failure mode is for it to run out of replacement sectors. Unless you bang it around while it's off, this is not likely to happen while it is off. So if it was running out of replacement sectors when you power down, the stresses of powering up might make it fail, but if it was healthy when powered down, it's not going to fail at power up. This may be less true for other less common failure modes.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.