Score:1

How do I force pacemaker to keep restarting SystemD resource if it fails (instead of putting it into 'stopped' state)?

de flag

My goal is to implement 2 nodes HTTP load-balancer, using virtual IP (VIP). For this task I picked pacemaker (virtual IP switching) and Caddy for HTTP load-balancer. Load-balancer choice is not a point of this question. :)

My requirement is simple - I want virtual IP to be assigned to the host where a healthy and working Caddy instance is running.

Here is how I implemented it using Pacemaker:

# Disable stonith feature
pcs property set stonith-enabled=false

# Ignore quorum policy
pcs property set no-quorum-policy=ignore

# Setup virtual IP
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=123.123.123.123

# Setup caddy resource, using SystemD provider. By default it runs on one instance at a time, so clone it and cloned one by default runs on all nodes at the same time.
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/configuring_the_red_hat_high_availability_add-on_with_pacemaker/ch-advancedresource-haar
pcs resource create caddy systemd:caddy clone

# Enable constraint, so both VirtualIP assigned and application running _on the same_ node.
pcs constraint colocation add ClusterIP with caddy-clone INFINITY

However, if I SSH to the node where Virtual IP is assigned, malform Caddy configuration file and do systemctl restart caddy - after some time pacemaker detects that caddy failed to start and simply puts it into stopped state.

How do I force pacemaker to keep restarting my SystemD resource instead of putting it into stopped state?

screenshot1

On top of that - if I fix configuration file and do systemctl restart caddy, it starts, but pacemaker just further keeps it in stopped state.

And on top of on top of that - if I stop the other node, virtual ip is not assigned anywhere because of below:

# Enable constraint, so both VirtualIP assigned and application running _on the same_ node.
pcs constraint colocation add ClusterIP with caddy-clone INFINITY

Can someone point me to the right direction of what I am doing wrong?

in flag
It should be the job of systemd to restart a failed service.
Nikita Kipriyanov avatar
za flag
@GeraldSchneider no. Once we are talking about clustered resource, the cluster resource manager must take over this job. Pacemaker, Veritas cluster, whatever. Systemd (or any other local init system) should continue to only control local resources (including running the CRM itself).
Score:2
nr flag

In Pacemaker certain failures are considered fatal, and once they're encountered, they need to be cleaned up manually (unless you've configured node level fencing, which would clean them up for you by fencing a failed node).

You need to tell Pacemaker that start operation failures are not fatal. I will usually also set a failure timeout, which automatically cleans up operation failures after some amount of seconds, in clusters without fencing.

pcs property set start-failure-is-fatal=false
pcs property set failure-timeout=300    
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.