Trying to use keepalived for failover and forwarding but getting "Keepalived_healthcheckers[1706]: TCP socket bind failed. Rescheduling."

Question

Score:3

Server

Trying to use keepalived for failover and forwarding but getting "Keepalived_healthcheckers[1706]: TCP socket bind failed. Rescheduling."

jstnewb

11/20/22, 12:01 PM

The goal is to get two different CentOS 7 VMs with keepalived installed to perform failover with VIP 192.168.1.11 and also forward the http (to become https shortly after this works) traffic to a corresponding http server.

192.168.1.11 vm1 (MASTER) --> fwd http to 192.168.1.71
192.168.1.11 vm2 (BACKUP) --> fwd http to 192.168.1.72

I had the failover part of this (with keepalived) previously working but with haproxy (on each vm) handling the forwarding instead. Now that I am trying to get keepalived to do the forwarding (or in this case the mode I'm trying to use is direct routing I believe) I am getting socket bind errors in the status output and failover doesnt work.

Here is the vm1 keepalived.conf:

global_defs { 
    script_user root 
} 

vrrp_instance VIP01 {
    state MASTER 
    interface eth0
    virtual_router_id 101
    priority 101
    advert_int 1

    authentication {
         auth_type PASS
         auth_pass [snip]
    }

    virtual_ipaddress {
        192.168.1.11/24
    }
}
    virtual_server 192.168.1.11 8080 {
        delay_loop 10
        protocol TCP
        lb_algo rr
        lb_kind DR
        persistence_timeout 7200

        real_server 192.168.1.71 8080 {
           weight 1
           TCP_CHECK {
            connect_timeout 5
            connect_port 8080
           }
        }
    }

and vm2:

global_defs { 
    script_user root 
} 

vrrp_instance VIP01 {
    state BACKUP     
    interface eth0
    virtual_router_id 101    
    priority 100   
    advert_int 1

    authentication {
         auth_type PASS
         auth_pass [snip]
    }

    virtual_ipaddress {
        192.168.1.11/24   
        }
}

virtual_server 192.168.1.11 80 {
    delay_loop 10
    protocol TCP
    lb_algo rr
    lb_kind DR
    persistence_timeout 7200

    real_server 192.168.1.72 8080 {
        weight 1
        TCP_CHECK {
          connect_timeout 5
          connect_port 8080
        }
    }
}

the output from systemctl status keepalived (on both vms):

...
Jul 20 07:52:16 [hostname] Keepalived_healthcheckers[1738]: TCP socket bind failed. Rescheduling.
Jul 20 07:52:26 [hostname] Keepalived_healthcheckers[1738]: TCP socket bind failed. Rescheduling.
Jul 20 07:52:36 [hostname] Keepalived_healthcheckers[1738]: TCP socket bind failed. Rescheduling.

I also tried adding the following to /etc/sysctl.conf:

net.ipv4.ip_forward = 1
net.ipv4.ip_nonlocal_bind = 1

and confirmed they took by querying them after reboot.

I realize that using load balancing with round robin with one server in the list is not really load balancing, but I just saw it as a way to do the forwarding, if there's a more concise/better way to do this I'm interested.

edits:

if I comment out the TCP check it looks like the failure to bind messages disappear. I have checked the destination IP/port by navigating to http://192.168.1.71:8080 in browser and it works as expected, however it does not work going through the VIP .11. Looks like it should be a HTTP_GET check anyway.

I can curl the page from curl http://192.168.1.71:8080 from the cmd line of vm1 so I know it has access to .71's http server.

Navigating in a browser to http://192.168.1.11:8080 still results in a timeout. status shows no signs of an issue, going to look into a more verbose log option...

This is where I've picked up most of what I have...

According to this (bottom page 6) chances are keepalived is removing the real server from the list. It seems like there may be something preventing the keepalived service from being able to hit the real server with the TCP check or the HTTP get. maybe selinux policy?

/var/log/audit/audit.log was full of keepalived entires...

found this and attempted setting the allow connect any boolean which didnt change my results.

also tried using audit2allow to generate rules and then apply them and although the audit log seems to have stopped logging denied msgs the forwarding from 11 to 71 is still not working.

still not seeing anything indicative of errors:

Jul 20 12:46:59 [hostname] Keepalived[1951]: Starting Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2
Jul 20 12:46:59 [hostname] Keepalived[1951]: Opening file '/etc/keepalived/keepalived.conf'.
Jul 20 12:46:59 [hostname] Keepalived[1952]: Starting Healthcheck child process, pid=1953
Jul 20 12:46:59 [hostname] Keepalived[1952]: Starting VRRP child process, pid=1954
Jul 20 12:46:59 [hostname] Keepalived_healthcheckers[1953]: Opening file '/etc/keepalived/keepalived.conf'.
Jul 20 12:46:59 [hostname] Keepalived_healthcheckers[1953]: Activating healthchecker for service [192.168.1.11]:8080
Jul 20 12:46:59 [hostname] systemd: Started LVS and VRRP High Availability Monitor.
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: Registering Kernel netlink reflector
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: Registering Kernel netlink command channel
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: Registering gratuitous ARP shared channel
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: Opening file '/etc/keepalived/keepalived.conf'.
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: Truncating auth_pass to 8 characters
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: VRRP_Instance(VIP01) removing protocol VIPs.
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: Using LinkWatch kernel netlink reflector...
Jul 20 12:46:59 [hostname] Keepalived_vrrp[1954]: VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(10,11)]
Jul 20 12:47:00 [hostname] Keepalived_vrrp[1954]: VRRP_Instance(VIP01) Transition to MASTER STATE
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: VRRP_Instance(VIP01) Entering MASTER STATE
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: VRRP_Instance(VIP01) setting protocol VIPs.
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: VRRP_Instance(VIP01) Sending/queueing gratuitous ARPs on eth0 for 192.168.1.11
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:01 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:06 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:06 [hostname] Keepalived_vrrp[1954]: VRRP_Instance(VIP01) Sending/queueing gratuitous ARPs on eth0 for 192.168.1.11
Jul 20 12:47:06 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:06 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:06 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11
Jul 20 12:47:06 [hostname] Keepalived_vrrp[1954]: Sending gratuitous ARP on eth0 for 192.168.1.11

also worth mentioning I previously disabled firewalls to rule them out...

pinging 192.168.1.11 and pulling the network connection to vm1 results in failover as expected. so the issue is really with my virtual/real server setup somehow...

346

0 + 0

failovercluster

keepalived

centos7

Trying to use keepalived for failover and forwarding but getting "Keepalived_healthcheckers[1706]: TCP socket bind failed. Rescheduling."

Post an answer