Score:1

Fixing bad network drivers on 2019 DC S2D installation

in flag

I am helping to troubleshoot an existing Windows failover cluster on Windows Server 2019 Data Center. The cluster is set up to use Storage Spaces Direct (http://aka.ms/s2d) but there are problems with the deployment, specifically the network driver versions are mismatched on the NIC ports used by S2D. To help upright the cluster, I will take the cluster down and go to a same-driver-version for all network ports on all nodes. Currently the plan is to run the NIC driver install after the cluster is offline, then simultaneously reboot all nodes and online the cluster.

Here is the rub: there apparently is NO official documentation on how to do this safely, other than doing an in-place update/upgrade of the driver software. Removal of the device entries followed by re-installation of the new driver software appears to be "a bit heavy-handed" as it will also entail a complete reset of the networking stack, via (somewhat obscure) powershell commands.

The Question (in three pieces):

Is there a known-best-practice for updating NIC drivers on a failover cluster running S2D (not to be confused with HDD or SDD or NVMe drivers) that would guarantee a clean driver update without disturbing the cluster?

Is my "in place" method sufficient, and will work as intended?

Or is this cluster simply done and over and probably needs a destroy-and-rebuild?

Score:2
in flag

Many months later, I have the answer. Specifically, to answer each part of my question:

  1. There really isn't a known method or document to point to at this time (that I could find).
  2. The in-place method is feasible, but has caveats.
  3. You do not need to destroy the cluster, although if you are careless, you could easily do so.

You'll need to have administrative powershell access and you'll have both hands dirty up to the wrists to get this done. This is a dangerous procedure. You run the risk of crippling or even destroying your cluster if you do this incorrectly. I cannot speak to every installation, I can only describe how I resolved the issue. If you are unsure about the steps below, or don't have the drivers, or have any other reason that could cause an issue, do not proceed. If things break, you keep both pieces and I am not to blame. You have been duly warned of the danger of data loss.

For my specific installation, the problem was with the specific driver for the NICs themselves. After much research - including a deep and revealing reading of the vendor errata for the NIC drivers - I was able to determine the correct version level to use. After uninstalling all network driver versions and/or rolling back to the specific version needed, the NICs behaved themselves, and once the rectification process was complete, have since remained well-behaved. Unfortunately this requires taking the nodes offline one at a time, but as long as you are patient, this is doable.

  1. Ensure you have all of the prerequisite driver(s) available before beginning. Copy the driver(s) to your destination node so it is locally available. Your data on the cluster (and specifically the cluster shared volumes) has been backed up, cloned, or moved to a safe location. You recognize that you will degrade any cluster shared volumes in use, along with any hyperconverged VMs. You are willing and able to eject a node from the cluster should things go sideways. These are all serious prerequisites and you should ensure you can do any/all of them as you decide. Do not proceed until you have considered all of these at a minimum.
  2. for the given node that requires NIC driver updates, drain all active roles and prepare to pause it, as if starting a maintenance or patching cycle.
  3. once the drain/pause is complete, proceed to disable any hyper-v switches that are "wrapping" the NICS (should you have this common arrangement). Some deployments use this arrangment to try and "fuse" or "bind" the NICs into a unified interface. Note that some clusters will not have this, proceed to the next step if that is the case.
  4. proceed to disable the physical NICs in Windows, attached to the private fabric switch. At this point, no traffic should be passing from the cluster to the node.
  5. Rollback of the drivers may trigger a removal of DCB and/or QoS settings for your NIC, depending on the vendor's installation process. While unlikely, there is still a possibility that the next step will be crossing a point of no return for the node if this is the case. Be sure you are ready to deal with the posibility that the node may not be able to rejoin the cluster due to this activity.
  6. proceed to uninstall and/or roll back drivers on the NIC(s) assigned to the private fabric switch. It's ok if they default to the stock drivers that are downloaded from Microsoft, since you can replace them with the drivers you staged earlier on the local boot drive.
  7. If the stock NIC drivers from Microsoft are not the correct version you require, or if the drivers are only available from the vendor, proceed to install the drivers for the NIC. Depending on the driver software and your circumstances, you may be required to reboot the node - hopefully not, but still be prepared if that is the case.
  8. Examine the driver revision level for each NIC attached to the private fabric switch. They must be the version you require. If not, take whatever remedial action is necessary to get the correct drivers into place.
  9. Ensure that all of the correct settings are present on the host, specifically that the Data Center Bridging feature is enabled for Windows, that all private fabric NIC ports have RDMA enabled, that DCBX acceptance for those same NIC ports is disabled (that is correct, not a typo, it is the Set-NetQoSDcbxSetting -Willing parameter), that you have no conflicting QoSPolicies for what you need, that you have a "Cluster" QoS Policy, a "SMB" policy, and a "SMB Direct" policy, that you create ETS traffic classes for the "Cluster" and "SMB Direct" (note that "SMB" will be covered by "SMB Direct" when using port 445), and that you assign a NetQoS flow control for Cluster (7) and SMB (3) respectively.
  10. Once all host NIC settings are confirmed, enable QoS for each NIC via the Enable-NetAdapterQos powershell commandlet.
  11. Re-enable the physical NIC ports to regain access to the private fabric switch.
  12. If using Hyper-V wrapper switches, re-enable the corresponding Hyper-V "NICs".
  13. Open the performance monitor (the old, creaky one that everyone forgets about), and place some lines on the screen for RDMA traffic. With the NICs re-enabled, you should start to see a trickle of consistent data coming in, since the cluster continues to silently shuttle S2D traffic even with a node in a paused state. Also open the Event Viewer and poke around any Clustering logs you can find, looking for evidence of data exchange with other nodes.
  14. If you are satisfied that the node is functional, proceed to resume its operation. You don't have to fail roles back to the original node, but that's entirely up to you and your circumstances.
  15. Once the cluster shared volumes are more or less stable, you can proceed to repeat this process for each node that requires it. You will want to ensure that your CSVs are healthy and stable before attempting this on another node. Do not rush the process.
  16. Once all nodes are stable, ensure that the entire cluster is returned to service.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.