I recently updated our cluster to Vmware Esxi 7.0 Update 3.
A couple of days later the virtual machine started to freeze randomly.
No message is shown on the screen. In vm events I see this messages:
In(05) vcpu-0 - NVME-VMM: Controller level reset via CC.EN bit transition on nvme0
In(05) vcpu-0 - NVME-CORE: Doing a partial reset of controller regs and queues.
In(05) vcpu-1 - NVME-VMK: nvme0:0: Ignoring completions [ignoreCmp=0].
In(05) vcpu-8 - NVME-VMM: Unexpected CQ#8 doorbell write: prevHead=46, newHead=47, size=256, inflight=0
In(05) vcpu-0 - Vix: [vmxCommands.c:7182]: VMAutomation_HandleCLIHLTEvent. Do nothing.
In(05) vcpu-0 - MsgHint: msg.monitorevent.halt
In(05)+ vcpu-0 - The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.
In the virtual machine I don't see any errors reported.
Except for this pops up at around the time the machine locks up.
kernel: [28667.084637] nvme nvme0: I/O 197 QID 14 timeout, aborting
kernel: [28667.084716] nvme nvme0: Abort status: 0x0
kernel: [28697.292556] nvme nvme0: I/O 197 QID 14 timeout, reset controller
kernel: [28697.356676] nvme nvme0: 15/0/0 default/read/poll queues
The virtual machine is configured with a nvme controller and a virtual disk is placed on a volume mapped to nvme storage over nvme over fibre channel.
After downgrading the Esxi version back to 7.0 Update 2d the issues are gone.
Vmware tells me it could be related to a kernel bug.
What could be the issue?