Due to cephs poor performance on NVME i'm experimenting with OCFS2 on DBRD again.
DBRD appears to initially have been built around the idea notifying an application of a hardware failure, and the application taking appropriate steps to move access to a new primary. This is not a useful pattern for our case.
DRBD 9 mentions a way based on quorums:
https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-configuring-quorum
if i understand correctly, loss of quorum (i.e. current node is in the minority partition) results in IO being frozen until quorum is reestablished. This is exciting news, as we have good experience with ceph recovering this way without intervention.
the DRBD manual sadly then mentions in section 12
https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#ch-ocfs2
All cluster file systems require fencing – not only through the DRBD resource, but STONITH! A faulty member must be killed.
which is the opposite of what it says earlier in quorum docs.
i'm not sure why it would require fencing anyway if quorum is supposed to already prevent the minority partition from executing write I/O. Is this because quorum does not actually work with dual-primary?