Score:0

High CPU usage and stability issues during Live Migration

br flag

I have been looking into an issue and am struggling to get a definitive answer or solution to a problem.

During live migrations of VMs between two hosts, the host receiving the VM will see a single CPU core spike to 100% and performance and stability is affected. For example, the task manager will be slow to respond, will freeze/stutter, and will lose data to display on the graphs… throughout the duration of the live migrations. Live Migration speeds max out at 6-7Gpbs. The sending server sees an increase of CPU core usage, but that is spread over 2-3 cores and not more than 50% each.

We have enabled vrss and vmmq, set the number of available queues correctly following various guides available on the internet. I can share those settings if desired. I understand that when using LBFO you cannot enable vmmq (VMMQEnabledRequested = True, but VMMQEnabled = False), so I set a host to use a SET switch, with no change or improvement.

We use Windows Server 2016 Core edition with just the Hyper-V roles running, have have no other agents or applications installed – this is a vanilla setup. We also have this occur on all of our clusters (which are identical).

The VMQ settings are set to avoid core 0, and we normally see core 4, 6 or 8 only, hitting 100% – i.e. NEVER core 0, and never on cores up to 16 (single proc) or 32 (dual proc).

We are using 2 x 10Gbe on a dual nic intel card (single PCI card), and are in a SIT LBFO team set to Hyper-V rather than Dynamic (although that settings makes no difference).

Networking is defined using SCVMM, and the hosts are using the SCVMM Virtual Switch for the dedicated Live Migration network.

We are currently using SMB for Live Migrations because we can limit SMB throughput to keep below the 100% CPU limit, but this issue occurs regardless of using TCP/IP, Compression, or SMB (although compression utilises the CPU for a much shorter period). NOTE: SMB throttling is disabled for my testing.

The key issue we are wanting to resolve is that the VMMS service sometimes hangs / locks during host drain events. E.g. if we perform CAU and each host is drained in turn, we sometimes get a failure because a host fails to drain all VMs. In that scenario the problem server sees the live migrations “stuck” at 3% (in FCM) and you cannot migrate, or restart the VMs (they shut down and never come back up), and most hyper-v related tools stop working (e.g. get-vm just hangs and never responds), and the ONLY fix to this is to hard reset the host (shutdown/restart fails to complete). We cannot find the cause of this, and the only symptoms we see are the host stability issues as noted above.

Please let me know what information you need to help advise on this issue.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.