Score:3

Azure datacenter 2019 failover cluster unresponsive after reboot only when shared disk is assigned CSV

co flag

We're seeing some really strange behaviour with a 2 node cluster on azure 2019 datacenter. We didn't see the issue right away but at some point it started happening and now we can repeat it.

We have an azure shared disk that we assign as a cluster shared volume in failover cluster manager. If we reboot one of the nodes upon starting up again windows explorer becomes unresponsive for quite some time. Interestingly powershell also is unresponsive (can't even type a command into it) until windows explorer becomes responsive. We've launced powershell using task manager. However launching a command window from task manager does not have the delay.

We've removed all roles from the cluster. Removed the software that was installed and formatted the CSV drive so it was all clean.

If we remove the disk as a CSV and leave it in available disks and reboot we do not get the delay. If we add it back on as a CSV we get the delay again. We can repeat this as much as we want.

If we bounce both nodes at once it takes up to 45 minutes for explorer and powershell to become active again. Doing the same with no CSV we have no issue.

I can't see anything in the event logs that is indicating the problem. It's a real strange phenomena.

I'd say it was a one off but we had this issue previously and decided to just redeploy from scratch. Everything worked fine for a day or 2 and then it started again.

We're pretty much at the end of things we can try and I'm wondering if anything has seen anything similar or if there is anything else we can look at.

Score:5
jp flag

That is a known problem that our customers stumbled upon as well. Recommendations from Microsoft support have been the following:

  • Check if you are using premium SSD as a shared disk;
  • Make sure to set maxShares parameter to align with the number of cluster nodes to make the disk shareable across all FCI nodes;

Neither of those recommendations worked for us. Redeploying the cluster from scratch temporarily fixes the problem, but as you have noticed, it comes back sooner or later.

A practical workaround to this problem is using Storage Spaces Direct or Virtual SAN software that essentially replicates the storage between both Azure virtual machines and allows you to build a Microsoft Failover Cluster on top of it. An additional virtual machine as an iSCSI target server is also a valid option.

Bee_Riii avatar
co flag
Thanks for confirming it's not just us! We're using premium and maxshares is set correctly.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.