Pacemaker cluster does not cleanly failover DRBD-resource (but does so manually)

Question

Score:4

Server

Pacemaker cluster does not cleanly failover DRBD-resource (but does so manually)

Stefan Mueller

6/20/24, 6:50 PM

I had to upgrade a cluster from Ubuntu 16.04. It did work fine on 18.04 and 20.04 but now on 22.04 it does not failover the DRBD-device. Putting the resource into maintenance mode and performing a manual drbdadm secondary/primary works instantly without an issue. However, when putting one node into standby, the resource fails and is fenced.

This happens on Ubuntu 22.04.2 LTS with Pacemaker 2.1.2 and Corosync 3.1.16. The DRBD version of the kernel module is 8.4.11 and that of the drbd-utils is 9.15.0. The configuration files for DRBD and Corosync do not contain anything interesting. I use crmsh to administer the cluster.

I can strip the situation down to a two-node setting with the following relevant settings.

node 103: server103
node 104: server104
primitive res_2-1_drbd_OC ocf:linbit:drbd \
        params drbd_resource=OC \
        op monitor interval=29s role=Master \
        op monitor interval=31s role=Slave \
        op start timeout=240s interval=0 \
        op promote timeout=90s interval=0 \
        op demote timeout=90s interval=0 \
        op notify timeout=90s interval=0 \
        op stop timeout=100s interval=0
ms ms_2-1_drbd_OC res_2-1_drbd_OC \
        meta master-max=1 master-node-max=1 target-role=Master clone-max=2 clone-node-max=1 notify=true
location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 0: server103
location loc_ms_2-1_drbd_OC_server104pref ms_2-1_drbd_OC 1: server104

The cluster does get into a good state:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * Promoted: [ server104 ]
    * Unpromoted: [ server103 ]

After resource maintenance ms_2-1_drbd_OC on, I can manually issue drbdadm secondary OC on server104 and drbdadm primay OC on server103 without a problem, and it instantly reverts back to the previous state upon resource maintenance ms_2-1_drbd_OC off, leading to expected status messages (which have to be cleaned).

Failed Resource Actions:
  * res_2-1_drbd_OC 31s-interval monitor on server103 returned 'promoted' at Tue Jun 20 17:40:36 2023 after 79ms
  * res_2-1_drbd_OC 29s-interval monitor on server104 returned 'ok' at Tue Jun 20 17:40:36 2023 after 49ms

Changing the location constraint leads to an immediate (!) switchover in the cluster. This does work both ways.

location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 10: server103

So far, so good -- and I would expect all things to work just fine. However, a forced failover does not succeed. From the good state above, I issue node standby server104 with the following effects and some information from journalctl -fxb (pacemaker-controld OK-messages left out):

1.) Pacemaker tries to promote on server103:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * res_2-1_drbd_OC   (ocf:linbit:drbd):       Promoting server103
    * Stopped: [ server104 server105 ]

journalctl on server104 (first second):

kernel: block drbd1: role( Primary -> Secondary ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless ) 
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)

journalctl on server103 (first second, pacemaker-controld ok-messages left out):

kernel: block drbd1: peer( Primary -> Secondary ) 
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( TearDown -> Unconnected ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Restarting receiver thread
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection ) 
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)
crm-fence-peer.sh (...)

2.) After the promote timeout of 90 seconds, Pacemaker fails the resource:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * res_2-1_drbd_OC   (ocf:linbit:drbd):       FAILED server103
    * Stopped: [ server104 server105 ]
Failed Resource Actions:
  * res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout'

journalctl on server104 (90th second):

pacemaker-attrd[1336]:  notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1336]:  notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687276647
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000

journalctl on server103 (90th through 93rd second):

pacemaker-execd[1595]:  warning: res_2-1_drbd_OC_promote_0[85862] timed out after 90000ms
pacemaker-controld[1598]:  error: Result of promote operation for res_2-1_drbd_OC on server103: Timed Out after 1m30s (Process did not exit within specified timeout)
pacemaker-attrd[1596]:  notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1596]:  notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687284510
crm-fence-peer.sh[85893]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-OC-ms_2-1_drbd_OC'
kernel: drbd OC: helper command: /sbin/drbdadm fence-peer OC exit code 5 (0x500)
kernel: drbd OC: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)
kernel: drbd OC: pdsk( DUnknown -> Outdated ) 
kernel: block drbd1: role( Secondary -> Primary ) 
kernel: block drbd1: new current UUID #1:2:3:4#
kernel: block drbd1: role( Primary -> Secondary ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: conn( WFConnection -> Disconnecting ) 
kernel: drbd OC: Discarding network configuration.
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless ) 
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-controld[1598]:  notice: Result of stop operation for res_2-1_drbd_OC on server103: ok
pacemaker-controld[1598]:  notice: Requesting local execution of start operation for res_2-1_drbd_OC on server103
systemd-udevd[86992]: drbd1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/drbd1' failed with exit code 1.
kernel: drbd OC: Starting worker thread (from drbdsetup-84 [87112])
kernel: block drbd1: disk( Diskless -> Attaching ) 
kernel: drbd OC: Method to ensure write ordering: flush
kernel: block drbd1: max BIO size = 1048576
kernel: block drbd1: drbd_bm_resize called with capacity == 3906131464
kernel: block drbd1: resync bitmap: bits=488266433 words=7629164 pages=14901
kernel: drbd1: detected capacity change from 0 to 3906131464
kernel: block drbd1: size = 1863 GB (1953065732 KB)
kernel: block drbd1: recounting of set bits took additional 3 jiffies
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated ) 
kernel: block drbd1: attached to UUIDs #1:2:3:4#
kernel: drbd OC: conn( StandAlone -> Unconnected ) 
kernel: drbd OC: Starting receiver thread (from drbd_w_OC [87114])
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection )

3.) Surprisingly, Pacemaker recovers and does promote the resource:

journalctl on server104 (113th second):

pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000

journalctl on server103 (98th through 114th second):

drbd(res_2-1_drbd_OC)[87174]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_OC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87178]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87182]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87186]: INFO: OC: Command stderr:
drbd(res_2-1_drbd_OC)[87217]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_NC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87221]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87225]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87229]: INFO: OC: Command stderr:
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000
kernel: block drbd1: role( Secondary -> Primary ) 
kernel: block drbd3: role( Secondary -> Primary ) 
kernel: block drbd3: new current UUID #1:2:3:4#

In the end, the resource did fail-over but leaves an error:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * Promoted: [ server103 ]
    * Stopped: [ server104 server105 ]
Failed Resource Actions:
  * res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout' at Tue Jun 20 20:07:00 2023 after 1m30.001s

Furthermore, a location constraint against server103 is added to the configuration:

location drbd-fence-by-handler-OC-ms_2-1_drbd_OC ms_2-1_drbd_OC rule $role=Master -inf: #uname ne server103

Resumé and things that I tried

The resource does automatically fail-over but has to wait for a seemingly unnecessary time-out and it leaves an error that has to be cleared manually.

Decreasing the demote timeout of the DRBD-resource to 30 seconds made the whole process fail without self-recovery. That does not make sense to me since the manual switch-over happens instantly. It seems as if the promote command did not make the resource secondary before switching it to primary. Yet I seem to be the only one experiencing this strange behavior.

I have dug through all available information going back to Heartbeat and different versions of Corosync, Pacemaker and DRBD. While upgrading the systems there were big issues concerning network connectivity and I may well have missed a crucial issue while hopping over three LTS versions. Furthermore I am not totally well acquainted with the HA technologies.

I would be very thankful for pointers into what direction I should look. Sorry for this extremely long posting! I hope you can skim over it and find the relevant information.

170

1 + 0

ubuntu

drbd

pacemaker

failovercluster

corosync

Pacemaker cluster does not cleanly failover DRBD-resource (but does so manually)

Post an answer