Score:4

Pacemaker cluster does not cleanly failover DRBD-resource (but does so manually)

rw flag

I had to upgrade a cluster from Ubuntu 16.04. It did work fine on 18.04 and 20.04 but now on 22.04 it does not failover the DRBD-device. Putting the resource into maintenance mode and performing a manual drbdadm secondary/primary works instantly without an issue. However, when putting one node into standby, the resource fails and is fenced.

This happens on Ubuntu 22.04.2 LTS with Pacemaker 2.1.2 and Corosync 3.1.16. The DRBD version of the kernel module is 8.4.11 and that of the drbd-utils is 9.15.0. The configuration files for DRBD and Corosync do not contain anything interesting. I use crmsh to administer the cluster.

I can strip the situation down to a two-node setting with the following relevant settings.

node 103: server103
node 104: server104
primitive res_2-1_drbd_OC ocf:linbit:drbd \
        params drbd_resource=OC \
        op monitor interval=29s role=Master \
        op monitor interval=31s role=Slave \
        op start timeout=240s interval=0 \
        op promote timeout=90s interval=0 \
        op demote timeout=90s interval=0 \
        op notify timeout=90s interval=0 \
        op stop timeout=100s interval=0
ms ms_2-1_drbd_OC res_2-1_drbd_OC \
        meta master-max=1 master-node-max=1 target-role=Master clone-max=2 clone-node-max=1 notify=true
location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 0: server103
location loc_ms_2-1_drbd_OC_server104pref ms_2-1_drbd_OC 1: server104

The cluster does get into a good state:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * Promoted: [ server104 ]
    * Unpromoted: [ server103 ]

After resource maintenance ms_2-1_drbd_OC on, I can manually issue drbdadm secondary OC on server104 and drbdadm primay OC on server103 without a problem, and it instantly reverts back to the previous state upon resource maintenance ms_2-1_drbd_OC off, leading to expected status messages (which have to be cleaned).

Failed Resource Actions:
  * res_2-1_drbd_OC 31s-interval monitor on server103 returned 'promoted' at Tue Jun 20 17:40:36 2023 after 79ms
  * res_2-1_drbd_OC 29s-interval monitor on server104 returned 'ok' at Tue Jun 20 17:40:36 2023 after 49ms

Changing the location constraint leads to an immediate (!) switchover in the cluster. This does work both ways.

location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 10: server103

So far, so good -- and I would expect all things to work just fine. However, a forced failover does not succeed. From the good state above, I issue node standby server104 with the following effects and some information from journalctl -fxb (pacemaker-controld OK-messages left out):

1.) Pacemaker tries to promote on server103:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * res_2-1_drbd_OC   (ocf:linbit:drbd):       Promoting server103
    * Stopped: [ server104 server105 ]

journalctl on server104 (first second):

kernel: block drbd1: role( Primary -> Secondary ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless ) 
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)

journalctl on server103 (first second, pacemaker-controld ok-messages left out):

kernel: block drbd1: peer( Primary -> Secondary ) 
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( TearDown -> Unconnected ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Restarting receiver thread
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection ) 
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)
crm-fence-peer.sh (...)

2.) After the promote timeout of 90 seconds, Pacemaker fails the resource:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * res_2-1_drbd_OC   (ocf:linbit:drbd):       FAILED server103
    * Stopped: [ server104 server105 ]
Failed Resource Actions:
  * res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout'

journalctl on server104 (90th second):

pacemaker-attrd[1336]:  notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1336]:  notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687276647
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000

journalctl on server103 (90th through 93rd second):

pacemaker-execd[1595]:  warning: res_2-1_drbd_OC_promote_0[85862] timed out after 90000ms
pacemaker-controld[1598]:  error: Result of promote operation for res_2-1_drbd_OC on server103: Timed Out after 1m30s (Process did not exit within specified timeout)
pacemaker-attrd[1596]:  notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1596]:  notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687284510
crm-fence-peer.sh[85893]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-OC-ms_2-1_drbd_OC'
kernel: drbd OC: helper command: /sbin/drbdadm fence-peer OC exit code 5 (0x500)
kernel: drbd OC: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)
kernel: drbd OC: pdsk( DUnknown -> Outdated ) 
kernel: block drbd1: role( Secondary -> Primary ) 
kernel: block drbd1: new current UUID #1:2:3:4#
kernel: block drbd1: role( Primary -> Secondary ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: conn( WFConnection -> Disconnecting ) 
kernel: drbd OC: Discarding network configuration.
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless ) 
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-controld[1598]:  notice: Result of stop operation for res_2-1_drbd_OC on server103: ok
pacemaker-controld[1598]:  notice: Requesting local execution of start operation for res_2-1_drbd_OC on server103
systemd-udevd[86992]: drbd1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/drbd1' failed with exit code 1.
kernel: drbd OC: Starting worker thread (from drbdsetup-84 [87112])
kernel: block drbd1: disk( Diskless -> Attaching ) 
kernel: drbd OC: Method to ensure write ordering: flush
kernel: block drbd1: max BIO size = 1048576
kernel: block drbd1: drbd_bm_resize called with capacity == 3906131464
kernel: block drbd1: resync bitmap: bits=488266433 words=7629164 pages=14901
kernel: drbd1: detected capacity change from 0 to 3906131464
kernel: block drbd1: size = 1863 GB (1953065732 KB)
kernel: block drbd1: recounting of set bits took additional 3 jiffies
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated ) 
kernel: block drbd1: attached to UUIDs #1:2:3:4#
kernel: drbd OC: conn( StandAlone -> Unconnected ) 
kernel: drbd OC: Starting receiver thread (from drbd_w_OC [87114])
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection ) 

3.) Surprisingly, Pacemaker recovers and does promote the resource:

journalctl on server104 (113th second):

pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000

journalctl on server103 (98th through 114th second):

drbd(res_2-1_drbd_OC)[87174]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_OC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87178]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87182]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87186]: INFO: OC: Command stderr:
drbd(res_2-1_drbd_OC)[87217]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_NC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87221]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87225]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87229]: INFO: OC: Command stderr:
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000
kernel: block drbd1: role( Secondary -> Primary ) 
kernel: block drbd3: role( Secondary -> Primary ) 
kernel: block drbd3: new current UUID #1:2:3:4#

In the end, the resource did fail-over but leaves an error:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * Promoted: [ server103 ]
    * Stopped: [ server104 server105 ]
Failed Resource Actions:
  * res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout' at Tue Jun 20 20:07:00 2023 after 1m30.001s

Furthermore, a location constraint against server103 is added to the configuration:

location drbd-fence-by-handler-OC-ms_2-1_drbd_OC ms_2-1_drbd_OC rule $role=Master -inf: #uname ne server103

Resumé and things that I tried

The resource does automatically fail-over but has to wait for a seemingly unnecessary time-out and it leaves an error that has to be cleared manually.

Decreasing the demote timeout of the DRBD-resource to 30 seconds made the whole process fail without self-recovery. That does not make sense to me since the manual switch-over happens instantly. It seems as if the promote command did not make the resource secondary before switching it to primary. Yet I seem to be the only one experiencing this strange behavior.

I have dug through all available information going back to Heartbeat and different versions of Corosync, Pacemaker and DRBD. While upgrading the systems there were big issues concerning network connectivity and I may well have missed a crucial issue while hopping over three LTS versions. Furthermore I am not totally well acquainted with the HA technologies.

I would be very thankful for pointers into what direction I should look. Sorry for this extremely long posting! I hope you can skim over it and find the relevant information.

Score:0
rw flag

OK, this was very, very odd! And that explains why I was the only one experiencing such an issue. Thanks to everyone who had a look at this problem!

Answering my own question: it must have been due to some obscure network problem. As mentioned, I had a rather intricate network setup with one big server(103) and two smaller ones (104 and 105 -- the latter not being mentioned for the sake of simplicity). Each of the smaller ones is connected via two back-to-back cables to server103 and communication is done by bonding the two together in a balanced round-robin scheme. This all did work and I cannot explain what really made a difference.

The only thing that I did to solve the problem was (re-?)applying the Netplan-configuration on server103 and reboot. This is extremely odd since this happens during every boot-up. Nevertheless it did the trick and fail-overs happen near instantly. It was a complete shot in the dark and sheer luck to have it hit the target. I was looking for any kind of asymmetry and somehow did suspect a communication issue. The transition from Ifupdown to Netplan had not been very smooth in this upgrade over three LTS-versions.

Afterwards, I also changed the Netplan-configuration files on server10{4,5} where each of the interfaces ("ethernets") had been configured with "activation-mode: manual". I switched these to empty configurations "{}". Now the failovers happen absolutely smoothly and this is persistent across reboots. After loosing two full working days to this spooky issue, it makes me extremely happy to put nodes into and out of standby just for fun.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.