I am having an argument with one of our suppliers about a "vanished" NVMe.
Server has two NVMe, running with in raid-1 via mdadm.
syslog:
Nov 20 01:31:21 server kernel: [4638997.424557] md/raid1:md0: Disk failure on nvme1n1p1, disabling device.
Nov 20 01:31:21 server kernel: [4638997.424557] md/raid1:md0: Operation continuing on 1 devices.
Nov 20 01:31:21 server udisksd[2123]: Unable to resolve /sys/devices/virtual/block/md0/md/dev-nvme1n1p1/block symlink
Nov 20 01:31:21 server udisksd[2123]: Unable to resolve /sys/devices/virtual/block/md0/md/dev-nvme1n1p1/block symlink
For me, it seems like a hardware-issue in the first place.
mdstat: only one nvme displayed
lspci: only one nvme displayed
nvme list: only one nvme displayed
Rebooted server, jumped into BIOS: only one NVMe displayed
shutdown server, just reinserted "vanished" NVMe into the same slot:
bios, lspci, nvme list, etc.: two nvme displayed, raid-rebuild running fine
Conversation with supplier:
- Me: what could be the cause? maybe hardware?
- Supplier: software-raid is dumb; it can cause errors like this.
- Me: I don't get it, I cannot imagine that a software-raid is capable of vanishing devices even in the BIOS.
- Supplier: The driver can set a "disable flag" in the BIOS on BMC-based systems, so that specific devices will be disabled completely.
Back at that time we just installed Ubuntu 18.04 without additional drivers, etc.
lspci -v
and lsmod
just displayed "nvme", I think it's the module shipped with Ubuntu.
Now my question to you: Are kernel modules (in this case: NVMe) capable of disabling devices in bios on BMC-based systems, so that they can vanish completely?