How to fix MARS (Multiversion Asynchronous Replicated Storage) timeout when viewing status?

Question

Score:2

Server

How to fix MARS (Multiversion Asynchronous Replicated Storage) timeout when viewing status?

davidgo

6/1/24, 8:25 PM

I'm running MARS (Multiversion Asynchronous Replicated Storage) between 2 remote locations, and for the most part it is running acceptably for my needs.

As part of monitoring I periodically issue a command

 marsadm view maildata

Initially this came back in a few seconds, but after a couple of months now takes close to 5 minutes to come back and tell me everything is OK. This timeframe has gradually been getting longer - and this appears to be a function of time since the replication was set up rather then anything to do with load or time of day. This problem occurs on both the primary and secondary server.

I tried adding --timeout=90, but this does not seem to make any difference.

The servers have a very low (in the order of 1 megabit) but continuous volume of data flowing across a 200ms latency connection which has more then 10 megabit of bandwidth available to it. The load on my servers hovers around "1" on 8 or more core CPUs.

/mars is shows 1% space utilization on both sides. The volume I am mirroring is less then 1tb in size, and sits on SSD's.

I manually ran marsadm cron on both servers and then reissued the command. It made no discernible difference.

The status of my primary looks like

LocalDevice /dev/mars/maildata [Opened, 4 IOPS]
maildata [2/2] UpToDate Replicating DCASFR Primary marsvmserver1

And similarly, on my secondary

  maildata [2/2] UpToDate Replaying dCASFR Secondary marsvmserver1

I don't really know where to look, but I have not found any log entries indicating any kind of problem.

I compiled a kernel from source based on 5.4.20, and can see that the mars module is loaded.

How can I modify things so that I can get the marsadm status information in a few seconds (or at least under 2 minutes)?

Update

(Still don't know whats happening)

/mars/resource-maildata has a large number of empty symlink files in the format:

version-000005492-marsvmserver2 -> b2ae552972debb9835fe7510ef059430,log-000005492-marsvmserver1,23019520:247d0e2c5721f303cc46da3ba7ab51e5,log-000005491-marsvmserver1,18510436

After taking adequate precautions with my data I attempted to remove these files, but they immediately reappeared. (They also existed on the secondary).

Interestingly I ran a "marsadm invalidate" on the secondary and this appears to do be doing something - Revalidating looks like it is going to take several hours. The number of symlink files has dropped by about half but is gradually increasing. The time to do a marsadm view maildata has also about halved.

Update 2 (very partial fix)

No doubt this is very wrong and dangerous to my data. I'm not entirely happy with this - but it is progress

I'd be grateful if anyone has a correct/better solution to my problem.

After the data resynced I was still left with over 5000 version files - it appears that the version files from the secondary node were gone, but the ones from the primary node were retained.

I played around with marsadm log-purge-all but this did not do anything.

In desperation I tried the following, which has greatly sped up the response (to a couple of seconds) - but I do not know what the side effects are -

I ensured that the disks were pretty much in-sync.

On the SECONDARY I issued "rmmod mars" and then proceeded to delete the version files older then a few days. They were not automatically recreated. I then deleted the same files on the primary, and then issued a "modprobe mars" on the secondary. After a few seconds of saying it was outdated and sync'ing on the secondary it became the primary.

I then wrote more data on my primary, created an LVM snapshot of the disk underlying the MARS replication on the secondary, mounted that snapshot and saw that the new data I wrote on the primary appeared on the secondary. (I then unmounted and destroyed the snapshot)

I note that new version files were being added to every 10 minutes - and checked and I was running "marsadm cron" every 10 minutes. I changed this to run every 1 minute, and sure enough the speed at which these files was created increased to 1 per minute - so the problem likely relates to the marsadm cron job.

49

0 + 0

replication

How to fix MARS (Multiversion Asynchronous Replicated Storage) timeout when viewing status?

Update

Update 2 (very partial fix)

Post an answer