Score:3

Howto: Block or File replication across 3+ nodes without a SAN

th flag

The setup

I admin the backend for a website that currently exists on a single node using Nginx (webserver), Neo4J (database) and Wildfly (app server). The website is getting enough traffic that we are both storage and memory resource limited on the current 'all-in-one' node, so I instantiated two more VPS nodes (3 in total) that will only run WildFly.

I've successfully configured Nginx to use the 'hash' load-balancing feature across the 3 nodes based on a user-ID contained within the website URI to ensure users are consistently routed to the same VPS node running Wildfly to optimize caching.

Each of the 3 nodes has their own 150GB high-availability block storage (maintained by the VPS provider) that contains a single /images directory mounted that the Wildfly app will be reading/writing image files from/to on its respective node.

Update

The image files should be write-once/read-many (at least for the nominal case) so new images get created all the time, but existing images rarely get updated. Additionally, because of Nginx's hash load-balancing, each Wildfly node should have all the images it needs for the clients that get routed to it. The need for replication is really two fold:

  1. It makes adding or removing Wildfly nodes transparent as each node has all the data from the other nodes
  2. It makes backing up easier as everything is consolidated in one place

Additionally, each of the VPS nodes are a part of a private gigabit VLAN that the VPS provider enables for all nodes in the same datacenter (of which all my nodes are.) It will be this link that the replication data will traverse.

The Problem

Because the app is now distributed, I want each of the /images directories across the 3 nodes to be fully replicated. Although Nginx's 'hash' load-balancing ensures consistent node usage on a per-user basis, I want the contents of the /images directory to be a union of all three nodes in case one of the nodes goes down and users need to be redistributed across the other available nodes.

The Question

What is the best way to address the problem above? From my understanding, rsync is not the appropriate tool for this job. There is this Server Fault Question which is similar in nature but it's 12 years old and i'm sure there have been some advances in data replication since that time.

In my research, I came across GlusterFS which looks promising, but it's unclear how to set this up to address my problem. Would I make each of the high-availability block storage devices on each node a single 'brick' and then combine that into a single Gluster volume? I presume I then create the /images directory on this single Gluster volume and mount this to each of the nodes via the native FUSE client? My gut says this is not correct because each of the nodes are both clients and server simultaneously as they are both contributing a 'brick' and read/writing to the Gluster volume which seems unconventional.

BaronSamedi1958 avatar
kz flag
Do you have a beefy and low-latency connection in between the cluster nodes?
th flag
The VPS provider allows each node that exists in the same datacenter (which all my nodes do) to have a Virtual Private Network. Running `iperf` across two nodes it appears this is a gigabit VLAN; not sure if this qualifies as "beefy."
BaronSamedi1958 avatar
kz flag
1Gb connectivity should be enough for virtual SAN solutions!
Score:1
ws flag

The SAN model supposes that you have a highly available block storage service - you could implement the same at file level - but this would mean adding more nodes (or putting additional workload on your existing hosts). And making NFS highly available is a bit tricky.

Another option for block level replication is to use DRBD. But with conventional filesystems, its not a good idea to have the filesystem mounted by more than one host. It can be used incombination with, e.g. GFS2. But this is still rather complex and esoteric. Combined with HTTP caching on the reverse proxy you could have the cache as the preferred location, the "primary" web server next, and the local storage as a third fallback option meaning that you are still handling most of the reads on the local filesystem but only have an issue of replication lag if a node is down.

Then there's filesystems that replicate - GlusterFS is probably the best choice here, and your interpretation of how it works seems accurate - but your concerns are not; this is exactly how glusterFS is expected to be used.

You mention VPS: the hypervisor may already provide a mechansim for sharing a block device across multiple hosts (e.g. io2 on AWS, shared volumes and directory storage on Proxmox) but you would still need to use a parallel filesystem (GFS2) here.

A quick mention here for ZFS replication - which is great but only really works between 2 nodes.

But really your choice depends on 2 specific predicates you have not addressed in your question: How quickly do files change? How are they changed? Maybe all you need is something like lsyncd (there's links to other solutions in th documentation) or perhaps even rsync.

th flag
Appreciate the answer. My VPS provider (Vultr) only allows block storage to be mounted to a single host. They do provide S3-compatible object storage but explicitly say they do not support you using it as a block device or mounted filesystem. I updated my answer to address your last two important questions. With these updates, would you say `lsyncd` is still a potential solution? It appears to use `rsync` under the hood which I thought is limited to only 2 nodes and whatever solution I use, I do want it to easily scale as we grow.
ws flag
If you've got limited options on the infrastructure and if the content is just added to each time, then the master node / http caching / rsync-or-lsync looks like a good solution. I don't understand the issue with rsync - yes, you do need to rsync to each non-master node. Tricky part is routing the file write to the master node and delegating that role if the current master node is offline.
ws flag
OTOH with only 3 nodes, it would be quite possible for each node to mount the relevant directory (via sshfs if NFS not available / can't be secured) from the BOTH the other 2, and use rsync to replicate locally on each node. Gets messy of you want to scale this though.
th flag
Thank you, with your guidance I was able to setup `glusterfs` and use all 3 nodes as both servers (hosting one brick each) as well as clients to the gluster volume -- it works great and scales arbitrarily large too!
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.