I am trying to understand the recovering process of a promotable
resource after "pcs cluster stop --all" and shutdown of both nodes.
I have a two nodes + qdevice quorum with a DRBD resource.
This is a summary of the resources before my test. Everything is
working just fine and server2 is the master of DRBD.
* fence-server1 (stonith:fence_vmware_rest): Started server2
* fence-server2 (stonith:fence_vmware_rest): Started server1
* Clone Set: DRBDData-clone [DRBDData] (promotable):
* Masters: [ server2 ]
* Slaves: [ server1 ]
* Resource Group: nfs:
* drbd_fs (ocf::heartbeat:Filesystem): Started server2
then I issue "pcs cluster stop --all". The cluster will be stopped on
both nodes as expected.
Now I restart server1( previously the slave ) and poweroff server2 (
previously the master ). When server1 restarts it will fence server2
and I can see that server2 is starting on vcenter, but I just pressed
any key on grub to make sure the server2 would not restart, instead it
would just be "paused" on grub screen.
SSH'ing to server1 and running pcs status I get:
Cluster name: cluster1
Cluster Summary:
* Stack: corosync
* Current DC: server1 (version 2.1.0-8.el8-7c3f660707) - partition with quorum
* Last updated: Mon May 2 09:52:03 2022
* Last change: Mon May 2 09:39:22 2022 by root via cibadmin on server1
* 2 nodes configured
* 11 resource instances configured
Node List:
* Online: [ server1 ]
* OFFLINE: [ server2 ]
Full List of Resources:
* fence-server1 (stonith:fence_vmware_rest): Stopped
* fence-server2 (stonith:fence_vmware_rest): Started server1
* Clone Set: DRBDData-clone [DRBDData] (promotable):
* Slaves: [ server1 ]
* Stopped: [ server2 ]
* Resource Group: nfs:
* drbd_fs (ocf::heartbeat:Filesystem): Stopped
Here are the contraints:
# pcs constraint
Location Constraints:
Resource: fence-server1
Disabled on:
Node: server1 (score:-INFINITY)
Resource: fence-server2
Disabled on:
Node: server2 (score:-INFINITY)
Ordering Constraints:
promote DRBDData-clone then start nfs (kind:Mandatory)
Colocation Constraints:
nfs with DRBDData-clone (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master)
Ticket Constraints:
# sudo crm_mon -1A
...
Node Attributes:
* Node: server2:
* master-DRBDData : 10000
So I can see there is quorum, but the server1 is never promoted as
DRBD master, so the remaining resources will be stopped until server2
is back.
- What do I need to do to force the promotion and recover without
restarting server2?
- Why if instead of rebooting server1 and power off server2 I reboot
server2 and poweroff server1 the cluster can recover by itself?
- Does that mean that for some reason during the "cluster stop --all" the drbd data got out of sync?