I have a Hyper-V virtual environment (I know, I know) in Windows Server 2019. This environment handles mostly Windows guests, but has a handful of linux machines, including two gen2 guests running Ubuntu 18.04 LTS.
My problem is these two guests often fail to reboot properly. When they start I can see the grub menu to select a kernel, and no matter what option I pick I'll see this (with the appropriate kernel version):
Loading Linux 4.15.0-167-generic ...
Loading initial ramdisk ...
Immediately after showing this message the VM will restart itself. I'll see this same message a few times in a loop before it gives up and just powers down completely.
I can find the echo
commands in the boot script that show these messages and added an additional Ramdisk loaded ...
message after the initrd command, to know it completes, and I do also see this message.
Here's the kicker: if I keep trying, eventually the machine will succeed and boot properly. Sometimes it can take dozens into a couple hundred retries, but so far they do always eventually boot. This has been going a for some time now, and each time I try to research what's going on, but I haven't been able to find any errors, and the machine will boot before I get far enough to find anything helpful.
One confounding factor in all this is I'm not typically looking to reboot the machine in the first place unless I've also done an apt ugrade
that's likely to include an updated kernel.
What could be going on here? What could be in a race condition here in such a way that the boot process will still eventually finish?