TL;DR: I need help setting up fencing on my pacemaker cluster.
I have a cluster of three machines running pacemaker. This is in my homelab, it's not a work setup.
Two are physical Dell servers. An R720xd and an R710. The third is a VM running on a beige box under libvirtd/qemu. The physical servers and the VM are Ubuntu Server 22.04.
These three machines also serve as members of an innodb cluster. The VM would keep rebooting, causing corruption in the mysql database that innodb cluster was not able to handle.
There are three pacemaker resources -- a VIP for haproxy, haproxy itself, and a VIP for mysqlrouter, fronting the innodb cluster. The two resources for haproxy are co-located and configured to prefer the R720xd, the innodb VIP is configured to prefer the R710. All the resources have a priority of -1000 for the VM -- The VM's role is to serve as the tie-breaker vote in both clusters, not to handle connections.
What was causing the instability was sbd. I stumbled across sbd when I was checking packages. I installed it and configured it with a software watchdog using the softdog kernel module, and found that having it running and re-enabling stonith on the cluster cleared up the cluster warning about no fencing.
On the VM, watchdog kept having trouble pinging the default gateway, which is a router running DD-WRT. I then changed that to a Cisco layer 3 switch also in the network, and increased the ping count from 2 to 10. It still kept not getting responses, so one of watchdog or sbd would hard-reset the VM.
So then I removed the ping. But it STILL had a problem -- I had set it up to watch /var/log/syslog with a last-updated interval of 900 seconds ... and it kept finding that the logfile was not changing fast enough, so the restart kept happening.
Then I changed it to /var/log/auth.log, which changes a LOT due to part of pacemaker -- crm_mon. It kept having the restart problem, which breaks HA tolerance on innodb cluster.
So at this point I have removed sbd from all 3 servers. Which leaves me without any kind of stonith.
I have idrac on the Dell servers, and tried to use the idrac fencing module for those. But I couldn't get it working. I also tried the ssh fencing module on all 3, and couldn't get that working either.
Even if I could get the idrac fencing module working, it would hard-kill the Dell servers, causing the innodb cluster problem to happen there. So I think my best option is probably to set up the ssh fencing module on all 3 servers. I have read all the warnings saying it's not designed for production, that if a server really gets in a bad state, that it can't actually fence it properly.
I do have a scripted procedure for fixing the innodb cluster after the problem happens, but recovery is resource-intensive, particularly on the VM. It takes a while to copy 17GB of mysql data.
I have all the servers monitored with zabbix, with a pacemaker template for the 3 servers mentioned, and the two Dell servers are also running a script that will email me in the event that innodb cluster ever reaches a status other than "OK", so I have a number of emails telling me the status is OK_NO_TOLERANCE_PARTIAL when the VM gets hard-reset. I keep a close eye on both of those during the course of each day. So I'm going to know if the clusters have a problem.
TL;DR:
If someone can explain how to set up the ssh fencing module so it works right, I would appreciate it. The three servers all have ssh keys for passwordless ssh as root. There is no hardware fencing device other than idrac, but any fencing that hard-kills a server is going to cause the same kind of issues that a bad software state would ... and I have monitoring in place to alert me when there is a problem that needs manual repair.