Score:2

How to fix MARS (Multiversion Asynchronous Replicated Storage) timeout when viewing status?

ne flag

I'm running MARS (Multiversion Asynchronous Replicated Storage) between 2 remote locations, and for the most part it is running acceptably for my needs.

As part of monitoring I periodically issue a command

 marsadm view maildata

Initially this came back in a few seconds, but after a couple of months now takes close to 5 minutes to come back and tell me everything is OK. This timeframe has gradually been getting longer - and this appears to be a function of time since the replication was set up rather then anything to do with load or time of day. This problem occurs on both the primary and secondary server.

I tried adding --timeout=90, but this does not seem to make any difference.

The servers have a very low (in the order of 1 megabit) but continuous volume of data flowing across a 200ms latency connection which has more then 10 megabit of bandwidth available to it. The load on my servers hovers around "1" on 8 or more core CPUs.

/mars is shows 1% space utilization on both sides. The volume I am mirroring is less then 1tb in size, and sits on SSD's.

I manually ran marsadm cron on both servers and then reissued the command. It made no discernible difference.

The status of my primary looks like

LocalDevice /dev/mars/maildata [Opened, 4 IOPS]
maildata [2/2] UpToDate Replicating DCASFR Primary marsvmserver1 

And similarly, on my secondary

  maildata [2/2] UpToDate Replaying dCASFR Secondary marsvmserver1 

I don't really know where to look, but I have not found any log entries indicating any kind of problem.

I compiled a kernel from source based on 5.4.20, and can see that the mars module is loaded.

How can I modify things so that I can get the marsadm status information in a few seconds (or at least under 2 minutes)?

Update

(Still don't know whats happening)

/mars/resource-maildata has a large number of empty symlink files in the format:

version-000005492-marsvmserver2 -> b2ae552972debb9835fe7510ef059430,log-000005492-marsvmserver1,23019520:247d0e2c5721f303cc46da3ba7ab51e5,log-000005491-marsvmserver1,18510436

After taking adequate precautions with my data I attempted to remove these files, but they immediately reappeared. (They also existed on the secondary).

Interestingly I ran a "marsadm invalidate" on the secondary and this appears to do be doing something - Revalidating looks like it is going to take several hours. The number of symlink files has dropped by about half but is gradually increasing. The time to do a marsadm view maildata has also about halved.

Update 2 (very partial fix)

No doubt this is very wrong and dangerous to my data. I'm not entirely happy with this - but it is progress

I'd be grateful if anyone has a correct/better solution to my problem.

After the data resynced I was still left with over 5000 version files - it appears that the version files from the secondary node were gone, but the ones from the primary node were retained.

I played around with marsadm log-purge-all but this did not do anything.

In desperation I tried the following, which has greatly sped up the response (to a couple of seconds) - but I do not know what the side effects are -

I ensured that the disks were pretty much in-sync.

On the SECONDARY I issued "rmmod mars" and then proceeded to delete the version files older then a few days. They were not automatically recreated. I then deleted the same files on the primary, and then issued a "modprobe mars" on the secondary. After a few seconds of saying it was outdated and sync'ing on the secondary it became the primary.

I then wrote more data on my primary, created an LVM snapshot of the disk underlying the MARS replication on the secondary, mounted that snapshot and saw that the new data I wrote on the primary appeared on the secondary. (I then unmounted and destroyed the snapshot)

I note that new version files were being added to every 10 minutes - and checked and I was running "marsadm cron" every 10 minutes. I changed this to run every 1 minute, and sure enough the speed at which these files was created increased to 1 per minute - so the problem likely relates to the marsadm cron job.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.