I have a 4 node cluster in AWS, which 2 nodes are continuosly getting diconnected and sometimes rebooting works and sometimes need to reboot all the nodes in the cluster to get all back.
[ec2-user@ip-172-31-7-235 ~]$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
xhei85m3mjp6wikz81phl01sx * ip-172-31-7-235.us-west-2.compute.internal Ready Active Leader 20.10.4
a63wole6vosq1t5s25wib8ggu ip-172-31-36-138.us-west-2.compute.internal Down Active 19.03.13-ce
guw26oul1i2fb60f5shud8xif ip-172-31-47-112.us-west-2.compute.internal Ready Active Reachable 19.03.13-ce
ex996ixxqo3s0mcig1zfzankg ip-172-31-47-251.us-west-2.compute.internal Ready Active 19.03.13-ce
And the output of inspect command:
[ec2-user@ip-172-31-7-235 ~]$ docker node inspect ip-172-31-36-138.us-west-2.compute.internal
[
{
"ID": "a63wole6vosq1t5s25wib8ggu",
"Version": {
"Index": 212444
},
"CreatedAt": "2021-02-10T13:25:54.271879167Z",
"UpdatedAt": "2021-07-23T07:36:17.078000983Z",
"Spec": {
"Labels": {},
"Role": "worker",
"Availability": "active"
},
"Description": {
"Hostname": "ip-172-31-36-138.us-west-2.compute.internal",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 2000000000,
"MemoryBytes": 8362287104
},
"Engine": {
"EngineVersion": "19.03.13-ce",
"Plugins": [
{
"Type": "Log",
"Name": "awslogs"
},
{
"Type": "Log",
"Name": "fluentd"
},
{
"Type": "Log",
"Name": "gcplogs"
},
{
"Type": "Log",
"Name": "gelf"
},
{
"Type": "Log",
"Name": "journald"
},
{
"Type": "Log",
"Name": "json-file"
},
{
"Type": "Log",
"Name": "local"
},
{
"Type": "Log",
"Name": "logentries"
},
{
"Type": "Log",
"Name": "splunk"
},
{
"Type": "Log",
"Name": "syslog"
},
{
"Type": "Network",
"Name": "bridge"
},
{
"Type": "Network",
"Name": "host"
},
{
"Type": "Network",
"Name": "ipvlan"
},
{
"Type": "Network",
"Name": "macvlan"
},
{
"Type": "Network",
"Name": "null"
},
{
"Type": "Network",
"Name": "overlay"
},
{
"Type": "Volume",
"Name": "local"
}
]
},
"TLSInfo": {
"TrustRoot": "-----BEGIN CERTIFICATE-----\nMIIBajCCARCgAwIBAgIUCi5JL30BEEaYOmlbrp9A+Rivul0wCgYIKoZIzj0EAwIw\nEzERMA8GA1UEAxMIc3dhcm0tY2EwHhcNMjEwMjEwMTMwMjAwWhcNNDEwMjA1MTMw\nMjAwWjATMREwDwYDVQQDEwhzd2FybS1jYTBZMBMGByqGSM49AgEGCCqGSM49AwEH\nA0IABFqgXKora10w8BODSxg9O4N9UveYhsitjwz+pHSi/6BB0j7YBu+4RADv4ZjK\nitIYTCLZZKbOx9saQ2YeB8sBxFajQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNVHRMB\nAf8EBTADAQH/MB0GA1UdDgQWBBTETORYsVN1OwUTjtYJHSJtGx55QzAKBggqhkjO\nPQQDAgNIADBFAiEA7qNRnsq0LUFenYODEah4Rku1YYpHBCHIid4W4Hy7MVcCICQF\n9BTfuQsAp5uQ72ycyWQfyQziFzbG+Sb/zQ8NzCRf\n-----END CERTIFICATE-----\n",
"CertIssuerSubject": "MBMxETAPBgNVBAMTCHN3YXJtLWNh",
"CertIssuerPublicKey": "MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEWqBcqitrXTDwE4NLGD07g31S95iGyK2PDP6kdKL/oEHSPtgG77hEAO/hmMqK0hhMItlkps7H2xpDZh4HywHEVg=="
}
},
"Status": {
"State": "down",
"Message": "heartbeat failure for node in \"unknown\" state",
"Addr": "172.31.36.138"
}
}
]
Please suggest how to backtrace and fix this issue?
The issue comes back even after replacing with a new node.