I'm running MARS (Multiversion Asynchronous Replicated Storage) between 2 remote locations, and for the most part it is running acceptably for my needs.
As part of monitoring I periodically issue a command
marsadm view maildata
Initially this came back in a few seconds, but after a couple of months now takes close to 5 minutes to come back and tell me everything is OK. This timeframe has gradually been getting longer - and this appears to be a function of time since the replication was set up rather then anything to do with load or time of day. This problem occurs on both the primary and secondary server.
I tried adding --timeout=90, but this does not seem to make any difference.
The servers have a very low (in the order of 1 megabit) but continuous volume of data flowing across a 200ms latency connection which has more then 10 megabit of bandwidth available to it. The load on my servers hovers around "1" on 8 or more core CPUs.
/mars is shows 1% space utilization on both sides.
The volume I am mirroring is less then 1tb in size, and sits on SSD's.
I manually ran marsadm cron on both servers and then reissued the command. It made no discernible difference.
The status of my primary looks like
LocalDevice /dev/mars/maildata [Opened, 4 IOPS]
maildata [2/2] UpToDate Replicating DCASFR Primary marsvmserver1
And similarly, on my secondary
maildata [2/2] UpToDate Replaying dCASFR Secondary marsvmserver1
I don't really know where to look, but I have not found any log entries indicating any kind of problem.
I compiled a kernel from source based on 5.4.20, and can see that the mars module is loaded.
How can I modify things so that I can get the marsadm status information in a few seconds (or at least under 2 minutes)?
Update
(Still don't know whats happening)
/mars/resource-maildata has a large number of empty symlink files in the format:
version-000005492-marsvmserver2 -> b2ae552972debb9835fe7510ef059430,log-000005492-marsvmserver1,23019520:247d0e2c5721f303cc46da3ba7ab51e5,log-000005491-marsvmserver1,18510436
After taking adequate precautions with my data I attempted to remove these files, but they immediately reappeared. (They also existed on the secondary).
Interestingly I ran a "marsadm invalidate" on the secondary and this appears to do be doing something - Revalidating looks like it is going to take several hours. The number of symlink files has dropped by about half but is gradually increasing. The time to do a marsadm view maildata has also about halved.
Update 2 (very partial fix)
No doubt this is very wrong and dangerous to my data. I'm not entirely happy with this - but it is progress
I'd be grateful if anyone has a correct/better solution to my problem.
After the data resynced I was still left with over 5000 version files - it appears that the version files from the secondary node were gone, but the ones from the primary node were retained.
I played around with marsadm log-purge-all but this did not do anything.
In desperation I tried the following, which has greatly sped up the response (to a couple of seconds) - but I do not know what the side effects are -
I ensured that the disks were pretty much in-sync.
On the SECONDARY I issued "rmmod mars" and then proceeded to delete the version files older then a few days. They were not automatically recreated. I then deleted the same files on the primary, and then issued a "modprobe mars" on the secondary. After a few seconds of saying it was outdated and sync'ing on the secondary it became the primary.
I then wrote more data on my primary, created an LVM snapshot of the disk underlying the MARS replication on the secondary, mounted that snapshot and saw that the new data I wrote on the primary appeared on the secondary. (I then unmounted and destroyed the snapshot)
I note that new version files were being added to every 10 minutes - and checked and I was running "marsadm cron" every 10 minutes. I changed this to run every 1 minute, and sure enough the speed at which these files was created increased to 1 per minute - so the problem likely relates to the marsadm cron job.