How to fix a node in docker swarm?

Question

Score:0

Server

How to fix a node in docker swarm?

uday

11/23/22, 7:46 AM

I have a 4 node cluster in AWS, which 2 nodes are continuosly getting diconnected and sometimes rebooting works and sometimes need to reboot all the nodes in the cluster to get all back.

[ec2-user@ip-172-31-7-235 ~]$ docker node ls
ID                            HOSTNAME                                      STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
xhei85m3mjp6wikz81phl01sx *   ip-172-31-7-235.us-west-2.compute.internal    Ready     Active         Leader           20.10.4
a63wole6vosq1t5s25wib8ggu     ip-172-31-36-138.us-west-2.compute.internal   Down      Active                          19.03.13-ce
guw26oul1i2fb60f5shud8xif     ip-172-31-47-112.us-west-2.compute.internal   Ready     Active         Reachable        19.03.13-ce
ex996ixxqo3s0mcig1zfzankg     ip-172-31-47-251.us-west-2.compute.internal   Ready     Active                          19.03.13-ce

And the output of inspect command:

[ec2-user@ip-172-31-7-235 ~]$ docker node inspect ip-172-31-36-138.us-west-2.compute.internal
[
    {
        "ID": "a63wole6vosq1t5s25wib8ggu",
        "Version": {
            "Index": 212444
        },
        "CreatedAt": "2021-02-10T13:25:54.271879167Z",
        "UpdatedAt": "2021-07-23T07:36:17.078000983Z",
        "Spec": {
            "Labels": {},
            "Role": "worker",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "ip-172-31-36-138.us-west-2.compute.internal",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 2000000000,
                "MemoryBytes": 8362287104
            },
            "Engine": {
                "EngineVersion": "19.03.13-ce",
                "Plugins": [
                    {
                        "Type": "Log",
                        "Name": "awslogs"
                    },
                    {
                        "Type": "Log",
                        "Name": "fluentd"
                    },
                    {
                        "Type": "Log",
                        "Name": "gcplogs"
                    },
                    {
                        "Type": "Log",
                        "Name": "gelf"
                    },
                    {
                        "Type": "Log",
                        "Name": "journald"
                    },
                    {
                        "Type": "Log",
                        "Name": "json-file"
                    },
                    {
                        "Type": "Log",
                        "Name": "local"
                    },
                    {
                        "Type": "Log",
                        "Name": "logentries"
                    },
                    {
                        "Type": "Log",
                        "Name": "splunk"
                    },
                    {
                        "Type": "Log",
                        "Name": "syslog"
                    },
                    {
                        "Type": "Network",
                        "Name": "bridge"
                    },
                    {
                        "Type": "Network",
                        "Name": "host"
                    },
                    {
                        "Type": "Network",
                        "Name": "ipvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "macvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "null"
                    },
                    {
                        "Type": "Network",
                        "Name": "overlay"
                    },
                    {
                        "Type": "Volume",
                        "Name": "local"
                    }
                ]
            },
            "TLSInfo": {
                "TrustRoot": "-----BEGIN CERTIFICATE-----\nMIIBajCCARCgAwIBAgIUCi5JL30BEEaYOmlbrp9A+Rivul0wCgYIKoZIzj0EAwIw\nEzERMA8GA1UEAxMIc3dhcm0tY2EwHhcNMjEwMjEwMTMwMjAwWhcNNDEwMjA1MTMw\nMjAwWjATMREwDwYDVQQDEwhzd2FybS1jYTBZMBMGByqGSM49AgEGCCqGSM49AwEH\nA0IABFqgXKora10w8BODSxg9O4N9UveYhsitjwz+pHSi/6BB0j7YBu+4RADv4ZjK\nitIYTCLZZKbOx9saQ2YeB8sBxFajQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNVHRMB\nAf8EBTADAQH/MB0GA1UdDgQWBBTETORYsVN1OwUTjtYJHSJtGx55QzAKBggqhkjO\nPQQDAgNIADBFAiEA7qNRnsq0LUFenYODEah4Rku1YYpHBCHIid4W4Hy7MVcCICQF\n9BTfuQsAp5uQ72ycyWQfyQziFzbG+Sb/zQ8NzCRf\n-----END CERTIFICATE-----\n",
                "CertIssuerSubject": "MBMxETAPBgNVBAMTCHN3YXJtLWNh",
                "CertIssuerPublicKey": "MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEWqBcqitrXTDwE4NLGD07g31S95iGyK2PDP6kdKL/oEHSPtgG77hEAO/hmMqK0hhMItlkps7H2xpDZh4HywHEVg=="
            }
        },
        "Status": {
            "State": "down",
            "Message": "heartbeat failure for node in \"unknown\" state",
            "Addr": "172.31.36.138"
        }
    }
]

Please suggest how to backtrace and fix this issue? The issue comes back even after replacing with a new node.

261

0 + 2

amazon-web-services

docker-swarm

How to fix a node in docker swarm?

Post an answer