Score:5

Building a low-cost, high-availability cluster using Windows Failover Cluster or Proxmox

jp flag

I need to build a high availability cluster (key functionality, SLA almost 24/7) for virtual machines (AD DC, FS, WSUS, Print server, one Oracle database, a few Linux (not important to business). Performance is not that important, but everything need of course to work well.

Things I have:

  • for now one physical site (data center)
  • license for Windows Server 2019 Standard
  • ability to install Proxmox and buying support for Proxmox
  • two not identical Lenovo servers with support contract (one have 16 cores the other one 20 cores, one have 9 x 279GB drive the other 3 x 279GB (both can use RAID5)
  • two 1 GB stacked switches
  • Synology with 2 power supply and 4 x GB ethernet card

Things I can buy:

  • a pro storage array
  • a hba card for Lenovo servers

The initial idea is to build a Windows high availability cluster connected over iSCSI to Synology (or a new storage arrey) for virtual servers that not need a fast read/write storage bandwidth (2 nodes and one storage device 1 point of failure).

I've read some articles about Storage Replica. Can I build a cluster with 2 nodes and one storage array for virtual machines that do not require performance and use the Storage Replica mechanism (on the same nodes) on volume on disk nodes for virtual machines that need more performance?

EDIT (more info):

Can I have VM with one (one from two) DC sever on Windows Failover Cluster? The second VM with DC (all AD master roles) will be on other server that is not part of cluster.

I have 2 VM with DC for AD but I need failover solution for other services/servers.

There is a option called Cluster awernes updates in Windows failover. It is no working as I assum (because name is self explaining)?

Recovery Time Objective and Recovery Point Objective are not so strict. Bussines will alive if ther by a 1 hour gap for less critical services and 15 min gap for mission critical services.

cn flag
Active Directory does not support Windows Failover Clusters.
vidarlo avatar
ar flag
Have you considered making the services redundant on service level, by e.g. hosting two DC's?
Zac67 avatar
ru flag
Windows requires you to install updates - on the machine level you won't reach .99999 and need to create redundancy on the service level. Also, the network, storage (Synology sounds like single controller), power and UPS also need to be redundant.
cn flag
Two factors missing are Recovery Time Objective and Recovery Point Objective. Storage Replica will not have the same Failover time as normal shared storage. Also I can tell by asking about domain controllers, those should not be anywhere near this setup. Active Directory DC's absolutely should be able to survive on their own, independent of unproven and problematic technologies that will delay recovery. When Storage Replica collapses and needs to be restored, it helps to be able to authenticate and get basic dial tone recovery while the collapsed storage is restored.
Score:7
kz flag

You can absolutely use SR (Storage Replica) to build a “poor man’s” Windows Server Failover Cluster (WSFC). See the example below where guys used SR to cluster SMB3 file service.

https://www.starwindsoftware.com/blog/part-1-storage-replica-with-failover-cluster-and-file-server-role-windows-server-technical-preview

This is how the process can be guided. As much as it can be, of course.

https://www.virtualizationhowto.com/2019/11/windows-server-2019-storage-replica-failover-process/

The problem is SR was intended to be used as a DR (Disaster Recovery) solution, not as an HA (High Availability) one. SR is not very flexible, needs some babysitting, quite seldom used, and requires Datacenter edition (except anemic 1TB edition included into Windows Server Standard). Not recommended. This is what Microsoft has to say.

https://learn.microsoft.com/en-us/windows-server/storage/storage-replica/storage-replica-overview

Bottom line… If you already paid for Datacenter you can try S2D (Storage Spaces Direct) thing, which is for brave people only as it’s not reliable still, or you can use Virtual SAN (vSAN) which is free and can be used not only with Standard, but with a free Hyper-V Server.

https://www.starwindsoftware.com/starwind-virtual-san-free

eKKiM avatar
lr flag
Can you elaborate on why S2D is "only for the brave"?
BaronSamedi1958 avatar
kz flag
https://storagespaceswarstories.com/category/stories/
BaronSamedi1958 avatar
kz flag
https://www.reddit.com/r/sysadmin/comments/ah07ri/my_review_after_a_year_of_storage_spaces_direct/
BaronSamedi1958 avatar
kz flag
https://www.reddit.com/r/sysadmin/comments/609e98/another_catastrophic_failure_on_our_windows/
cn flag
@eKKiM: S2D fundamentally makes zero sense for most organizations because storage is simpler and less expensive than it ever has been. S2D exists because it is what Microsoft uses internally and at Azure. That doesn't mean it's a good fit for anyone else.
Score:4
cn flag

Domain controllers use own replication mechanism and don’t need any shared storage.

https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/get-started/replication/active-directory-replication-concepts

https://www.manageengine.com/products/active-directory-audit/kb/how-to/how-to-check-if-domain-controllers-are-in-sync-with-each-other.html

File servers can be made HA with a help of DFS-R solution. It’s no perfect, but it works with quite some limitations (no true transparent failover, read penalty, and no split brain protection).

https://learn.microsoft.com/en-us/windows-server/storage/dfs-replication/dfsr-overview

Oracle has own database (DB) replication, similar to MS SQL Server Always On Availability Groups (AGs).

https://www.arcion.io/learn/oracle-replication#:~:text=Oracle%20replication%20allows%20users%20to,reporting%2C%20testing%2C%20and%20backups.

In a nutshell: Re-think what you’re doing, it could be you’re overthinking and over engineering the whole thing.

Score:3
hu flag

You should avoid Synology in production. It has single controller so during firmware updates, reboots or any issues you’ll have whole cluster down. Classic SPOF.

https://en.m.wikipedia.org/wiki/Single_point_of_failure

sokar avatar
jp flag
Can You explain what is a difference (when SLA 24/7 is important) between Synology device with 2 PSU and LACP and pro storage array ? I have never use a pro storage array.
El Marinero  avatar
hu flag
Synology won’t have high uptime because of the single-controller design. + Synology support isn’t Enterprise level. Taiwanese business hours.
Score:0
ws flag

If you want to use Proxmox for high availability then you need a cluster with three nodes (or (n*2 +1)) but the additional node can be VERY basic - just there as an observer to arbitrate on split-brain decisions. Note that the Proxmox HA privdes a means to spin up missing VMs when a cluster node goes offline. You don't need that for MS-AD - just make sure the existing nodes are distributed across separate hardware.

Do make sure you have enough resource (storage, CPU, memory) to run the critical VMs and any residual VMs on a surviving physical node; there's no mechanism in Proxmox (currently) to shutdown non-critical VMs to free up capacity for the critical ones.

Your Synology box is SPOF so that's not suitable for the primary storage for your HA VMs. It is a good place to keep backups, but you really want to use PBS for backups and that does NOT like latency in storage access - ISTR there are docker images available with PBS if your Synology supports docker.

The setup is a bit small for ceph based storage, but is there a reason you don't want to use the built-in ZFS replication or gluterfs on the hypervisor to ensure that the disk images are in sync?

While generally stuff like replication and HA is smoother when handled nearer the top of the application stack, it does mean having different implementations for each component - trying to use MS-Storage-Replica for your Oracle database is not likely to be a pleasant experience. I would definitely recommend using Proxmox / ZFS replication as the default then add exceptions where that is not a good fit.

RiGiD5 avatar
cn flag
ZFS replication is a DR, it’s no HA solution. It won’t help to OP at all.
ws flag
The mechanisms you listed give no better guarantees of consistency. And with the exception of active directory don't solve the service availability issue without additional components.
RiGiD5 avatar
cn flag
What are you talking about?! AD replication is basically built-in, it’s 100% Microsoft thing. Dude, there’s hundreds of millions replicated ADs deployed world-wide.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.