Many months later, I have the answer. Specifically, to answer each part of my question:
- There really isn't a known method or document to point to at this time (that I could find).
- The in-place method is feasible, but has caveats.
- You do not need to destroy the cluster, although if you are careless, you could easily do so.
You'll need to have administrative powershell access and you'll have both hands dirty up to the wrists to get this done. This is a dangerous procedure. You run the risk of crippling or even destroying your cluster if you do this incorrectly. I cannot speak to every installation, I can only describe how I resolved the issue. If you are unsure about the steps below, or don't have the drivers, or have any other reason that could cause an issue, do not proceed. If things break, you keep both pieces and I am not to blame. You have been duly warned of the danger of data loss.
For my specific installation, the problem was with the specific driver for the NICs themselves. After much research - including a deep and revealing reading of the vendor errata for the NIC drivers - I was able to determine the correct version level to use. After uninstalling all network driver versions and/or rolling back to the specific version needed, the NICs behaved themselves, and once the rectification process was complete, have since remained well-behaved. Unfortunately this requires taking the nodes offline one at a time, but as long as you are patient, this is doable.
- Ensure you have all of the prerequisite driver(s) available before beginning. Copy the driver(s) to your destination node so it is locally available. Your data on the cluster (and specifically the cluster shared volumes) has been backed up, cloned, or moved to a safe location. You recognize that you will degrade any cluster shared volumes in use, along with any hyperconverged VMs. You are willing and able to eject a node from the cluster should things go sideways. These are all serious prerequisites and you should ensure you can do any/all of them as you decide. Do not proceed until you have considered all of these at a minimum.
- for the given node that requires NIC driver updates, drain all active roles and prepare to pause it, as if starting a maintenance or patching cycle.
- once the drain/pause is complete, proceed to disable any hyper-v switches that are "wrapping" the NICS (should you have this common arrangement). Some deployments use this arrangment to try and "fuse" or "bind" the NICs into a unified interface. Note that some clusters will not have this, proceed to the next step if that is the case.
- proceed to disable the physical NICs in Windows, attached to the private fabric switch. At this point, no traffic should be passing from the cluster to the node.
- Rollback of the drivers may trigger a removal of DCB and/or QoS settings for your NIC, depending on the vendor's installation process. While unlikely, there is still a possibility that the next step will be crossing a point of no return for the node if this is the case. Be sure you are ready to deal with the posibility that the node may not be able to rejoin the cluster due to this activity.
- proceed to uninstall and/or roll back drivers on the NIC(s) assigned to the private fabric switch. It's ok if they default to the stock drivers that are downloaded from Microsoft, since you can replace them with the drivers you staged earlier on the local boot drive.
- If the stock NIC drivers from Microsoft are not the correct version you require, or if the drivers are only available from the vendor, proceed to install the drivers for the NIC. Depending on the driver software and your circumstances, you may be required to reboot the node - hopefully not, but still be prepared if that is the case.
- Examine the driver revision level for each NIC attached to the private fabric switch. They must be the version you require. If not, take whatever remedial action is necessary to get the correct drivers into place.
- Ensure that all of the correct settings are present on the host, specifically that the Data Center Bridging feature is enabled for Windows, that all private fabric NIC ports have RDMA enabled, that DCBX acceptance for those same NIC ports is disabled (that is correct, not a typo, it is the
Set-NetQoSDcbxSetting -Willing
parameter), that you have no conflicting QoSPolicies for what you need, that you have a "Cluster" QoS Policy, a "SMB" policy, and a "SMB Direct" policy, that you create ETS traffic classes for the "Cluster" and "SMB Direct" (note that "SMB" will be covered by "SMB Direct" when using port 445), and that you assign a NetQoS flow control for Cluster (7) and SMB (3) respectively.
- Once all host NIC settings are confirmed, enable QoS for each NIC via the
Enable-NetAdapterQos
powershell commandlet.
- Re-enable the physical NIC ports to regain access to the private fabric switch.
- If using Hyper-V wrapper switches, re-enable the corresponding Hyper-V "NICs".
- Open the performance monitor (the old, creaky one that everyone forgets about), and place some lines on the screen for RDMA traffic. With the NICs re-enabled, you should start to see a trickle of consistent data coming in, since the cluster continues to silently shuttle S2D traffic even with a node in a paused state. Also open the Event Viewer and poke around any Clustering logs you can find, looking for evidence of data exchange with other nodes.
- If you are satisfied that the node is functional, proceed to resume its operation. You don't have to fail roles back to the original node, but that's entirely up to you and your circumstances.
- Once the cluster shared volumes are more or less stable, you can proceed to repeat this process for each node that requires it. You will want to ensure that your CSVs are healthy and stable before attempting this on another node. Do not rush the process.
- Once all nodes are stable, ensure that the entire cluster is returned to service.