I run a hypervisor with multiple PCI devices that get passed into the hosted Virtual Machines.
I recently encountered an issue with one of the physical cards which caused it to stop responding, and this seems to have caused many parts of my system to hang, including shutdown / reboot commands.
Concern: To recover from this fault, I had to physically smother the machine by holding the power button. The device was fine after a reboot.
Question: As I also run one of these systems at a customer location, how can I ensure that in a failure such as this, I can still reboot the host?
The below messages were the only messages observed in dmesg while in this faulted state.
[388365.317349] pcieport 0000:b8:09.0: not ready 1023ms after bus reset; waiting
[388366.405341] pcieport 0000:b8:09.0: not ready 2047ms after bus reset; waiting
[388368.517337] pcieport 0000:b8:09.0: not ready 4095ms after bus reset; waiting
[388372.741060] pcieport 0000:b8:09.0: not ready 8191ms after bus reset; waiting
[388381.445281] pcieport 0000:b8:09.0: not ready 16383ms after bus reset; waiting
[388398.341198] pcieport 0000:b8:09.0: not ready 32767ms after bus reset; waiting
[388434.180784] pcieport 0000:b8:09.0: not ready 65535ms after bus reset; giving up
[388436.357023] pcieport 0000:b8:09.0: not ready 1023ms after bus reset; waiting
[388437.445019] pcieport 0000:b8:09.0: not ready 2047ms after bus reset; waiting
[388439.556755] pcieport 0000:b8:09.0: not ready 4095ms after bus reset; waiting
[388443.908994] pcieport 0000:b8:09.0: not ready 8191ms after bus reset; waiting
[388452.612953] pcieport 0000:b8:09.0: not ready 16383ms after bus reset; waiting
[388469.508875] pcieport 0000:b8:09.0: not ready 32767ms after bus reset; waiting