Score:1

HA kubernetes cluster: Accidental kubeadm reset on 1 master node, connection refused when rejoining the cluster

cn flag

I have setup a kubernetes cluster with 2 master nodes (cp01 192.168.1.42, cp02 192.168.1.46) and 4 worker nodes, implemented with haproxy and keepalived running as static pods in the cluster, internal etcd cluster. For some silly reasons, I accidentally kubeadm reset -f on cp01. Now I am trying rejoin the cluster using kubeadm join command but I keep getting the dial tcp 192.168.1.49:8443: connect: connection refused, where 192.168.1.49 is the LoadBalancer IP. Please help! Below are the current configurations.

/etc/haproxy/haproxy.cfg on cp02

defaults
    timeout connect 10s
    timeout client 30s
    timeout server 30s
frontend apiserver
    bind *.8443
    mode tcp
    option tcplog
    default_backend apiserver
backend apiserver
    option httpchk GET /healthz
    http-check expect status 200
    mode tcp
    option ssl-hello-chk
    balance roundrobin
        default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100
        #server master01 192.168.1.42:6443 check     ***the one i accidentally resetted
        server master02 192.168.1.46:6443 check

/etc/keepalived/keepalived.conf on cp02

global_defs {
    router_id LVS_DEVEL
    script_user root
    enable_script_security
    dynamic_interfaces
}
vrrp_script check_apiserver {
    script "/etc/keepalived/check_apiserver.sh"
    interval 3
    weight -2
    fall 10
    rise 2
}
vrrp_instance VI_l {
    state BACKUP
    interface ens192
    virtual_router_id 51
    priority 101
    authentication {
        auth_type PASS
        auth_pass ***
    }
    virtual_ipaddress {
        192.168.1.49/24
    }
    track_script {
        check_apiserver
    }
}

cluster kubeadm-config

apiVersion: v1
data:
    ClusterConfiguration: |
        apiServer:
            extraArgs:
                authorization-mode: Node,RBAC
            timeoutForControlPlane: 4m0s
        apiVersion: kubeadm.k8s.io/v1beta2
        certificatesDir: /etc/kubernetes/pki
        clusterName: kubernetes
        controlPlaneEndpoint: 192.168.1.49:8443
        controllerManager: {}
        dns:
            type: CoreDNS
        etcd:
            local:
                dataDir: /var/lib/etcd
        imageRepository: k8s.gcr.io
        kind: ClusterConfiguration
        kubernetesVersion: v1.19.2
        networking:
            dnsDomain: cluster.local
            podSubnet: 10.244.0.0/16
            serviceSubnet: 10.96.0.0/12
        scheduler: {}
    ClusterStatus: |
        apiEndpoints:
            cp02:
                advertiseAddress: 192.168.1.46
                bindPort: 6443
        apiVersion: kubeadm.k8s.io/v1beta2
        kind: ClusterStatus
...

kubectl cluster-info

Kubernetes master is running at https://192.168.1.49:8443
KubeDNS is running at https://192.168.1.49:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

More Info

  1. cluster was initialised with --upload-certs on cp01.

  2. I drained and deleted cp01 from the cluster.

  3. kubeadm join --token ... --discovery-token-ca-cert-hash ... --control-plane --certificate-key ... command returned:

    error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://192.168.1.49:8443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 192.168.1.49:8443: connect: connection refused
    
  4. kubectl exec -n kube-system -it etcd-cp02 -- etcdctl --endpoints=https://192.168.1.46:2379 --key=/etc/kubernetes/pki/etcd/peer.key --cert=/etc/kubernetes/pki/etcd/peer.crt --cacert=/etc/kubernetes/pki/etcd/ca.crt member list returned:

    ..., started, cp02, https://192.168.1.46:2380, https://192.168.1.46:2379, false
    
  5. kubectl describe pod/etcd-cp02 -n kube-system:

    ...
    Container ID: docker://...
    Image: k8s.gcr.io/etcd:3.4.13-0
    Image ID: docker://...
    Port: <none>
    Host Port: <none>
    Command:
      etcd
      --advertise-client-urls=https://192.168.1.46:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://192.168.1.46:2380
      --initial-cluster=cp01=https://192.168.1.42:2380,cp02=https://192.168.1.46:2380
      --initial-cluster-state=existing
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.46:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://192.168.1.46:2380
      --name=cp02
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      ...
    
  6. Tried copying the certs to cp01:/etc/kubernetes/pki before running kubeadm join 192.168.1.49:8443 --token ... --discovery-token-ca-cert-hash but returned same error.

    # files copied over to cp01
    ca.crt
    ca.key
    sa.key
    sa.pub
    front-proxy-ca.crt
    front-proxy-ca.key
    etcd/ca.crt
    etcd/ca.key
    

Troubleshoot network

  1. Able to ping 192.168.1.49 on cp01

  2. nc -v 192.168.1.49 8443 on cp01 returned Ncat: Connection refused.

  3. curl -k https://192.168.1.49:8443/api/v1... works on cp02 and worker nodes (returns code 403 which should be normal).

  4. /etc/cni/net.d/ is removed on cp01

  5. Manually cleared iptables rules on cp01 with 'KUBE' or 'cali'.

  6. firewalld is disabled on both cp01 and cp02.

  7. I tried joining with a new server cp03 192.168.1.48 and encountered the same dial tcp 192.168.1.49:8443: connect: connection refused error.

  8. netstat -tlnp | grep 8443 on cp02 returned:

    tcp    0    0.0.0.0:8443    0.0.0.0:*    LISTEN 27316/haproxy
    
  9. nc -v 192.168.1.46 6443 on cp01 and cp03 returns:

    Ncat: Connected to 192.168.1.46:6443
    

Any advice/guidance would be greatly appreciated as I am at a loss here. I'm thinking it might be due to the network rules on cp02 but I don't really know how to check this. Thank you!!

Score:1
cn flag

Figured out what was the issue when I entered ip a. Realised that ens192 on cp01 still contains the secondary ip address 192.168.1.49.

Simply ip addr del 192.168.1.49/24 dev ens192 and kubeadm join... and cp01 is able to rejoin the cluster successfully. Can't believe I missed that...

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.