I've a some VMs (working as web server) behind an instance group on my GCloud.
As usual maintenance I've update (apt dist-upgrade
) my "vm-source-image", created a new template and add it to my group.
The new members using this template never receive any real working request from the load-balancer and it is up and running but unemployed.
Temporary patch
I do only a partial update (the security ones) by:
sudo unattended-upgrade -d
Here the list of remaining packages that create the problem:
# apt list --upgradable
cloud-init/bionic-updates 21.3-1-g6803368d-0ubuntu1~18.04.4 all [upgradable from: 21.2-3-g899bfaa9-0ubuntu2~18.04.1]
dnsmasq-base/bionic-updates 2.79-1ubuntu0.5 amd64 [upgradable from: 2.79-1ubuntu0.4]
gce-compute-image-packages/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0]
google-compute-engine/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0]
google-compute-engine-oslogin/bionic-updates 20210728.00-0ubuntu1~18.04.0 amd64 [upgradable from: 20210429.00-0ubuntu1~18.04.0]
google-guest-agent/bionic-updates 20210629.00-0ubuntu1~18.04.1 amd64 [upgradable from: 20210414.00-0ubuntu1~18.04.0]
libgnutls30/bionic-updates 3.5.18-1ubuntu1.5 amd64 [upgradable from: 3.5.18-1ubuntu1.4]
libnetplan0/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4]
libpcre2-8-0/bionic 10.39-1+ubuntu18.04.1+deb.sury.org+1 amd64 [upgradable from: 10.36-2+ubuntu18.04.1+deb.sury.org+2]
netplan.io/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4]
nplan/bionic-updates 0.99-0ubuntu3~18.04.5 all [upgradable from: 0.99-0ubuntu3~18.04.4]
snapd/bionic-updates 2.51.1+18.04 amd64 [upgradable from: 2.49.2+18.04]
ubuntu-advantage-tools/bionic-updates 27.3~18.04.1 amd64 [upgradable from: 27.2.2~18.04.1]
A REAL SOLUTION
Since I've no "custom" package on the machine and the origin of this problem comes from a system update I see no solution except point-out the problem by this post.
Of course I'm monitoring the new updates hoping that a new version of this packages solve the problem, but it's possible ther're no better options?
More info
- The group is the backend of an "internal TCP load-balancer".
- The frontend IP address of the load-balancer is 10.0.0.116
- The old (and working) member IP address is 10.0.0.48 (seeable the logs)
- The new (and unemployed) member IP address is 10.0.0.54 (seeable the logs)
- The load-balancer has a simple HTTP health-check know as HTTPHC1.
- The instance-group has another simple HTTP health-check know as HTTPHC2.
Comparing the access log of an old (and working) member with the new one:
Log of a old VM member
35.191.1.148 "/" - - - [04/Nov/2021:10:34:59 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.144 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.154 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.147 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.145 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.151 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.153 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
Log of a new VM member
35.191.1.152 "/" - - - [04/Nov/2021:10:31:01 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.154 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.148 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
The difference shows the missing of the logs of HTTPHC1.
So the new new don't answer to the health check of the load-balancer (HTTPHC1) and don't receive requests and that's the problem.
Other malfunctions
The new machine is also unaccessible by browser-window-SSH
ADD tcpdump
Between HTTPHC1 health-checker and unemployed member:
# tcpdump -n host 35.191.1.151
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
11:30:35.109469 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:36.119470 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:38.167436 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:40.110784 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:41.111176 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:43.159164 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:45.112162 IP 35.191.1.151.36064 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
Note that the destination is load-balancer frontend IP: 10.0.0.116 and of course they're only Sync packets.
Between HTTPHC2 health-checker and unemployed member:
# tcpdump -n host 35.191.1.148
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
10:46:12.475724 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
10:46:12.475788 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [S.], win 64768, options [mss 1420,sackOK,TS,nop,wscale 7], length 0
10:46:12.476239 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 1, win 256, options [nop,nop,TS], length 0
10:46:12.476239 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [P.], seq 1:117, ack 1, win 256, options [nop,nop,TS], length 116: HTTP: GET /?id=HTTPHC2 HTTP/1.1
10:46:12.476301 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [.], ack 117, win 506, options [nop,nop,TS], length 0
10:46:12.476546 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [P.], seq 1:867, ack 117, win 506, options [nop,nop,TS], length 866: HTTP: HTTP/1.1 200 OK
10:46:12.476659 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 867, win 267, options [nop,nop,TS], length 0
10:46:12.476679 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [F.], seq 117, ack 867, win 267, options [nop,nop,TS], length 0
10:46:12.476707 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [F.], seq 867, ack 118, win 506, options [nop,nop,TS], length 0
10:46:12.476879 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 868, win 267, options [nop,nop,TS], length 0
Here everything is fine.
ADD 2021-11-16
After some research I found a missing IP alias in local table, no surprise to see that is the frontend-load-balancer IP address, visible as the DST host in the tcpdump
!
Here the working machine:
# ip route show dev ens4 table local
local 10.0.0.48 proto kernel scope host src 10.0.0.48
local 10.0.0.116 proto 66 scope host
# uname -r
5.4.0-1056-gcp
And here the Fully Updated machine:
# ip route show dev ens4 table local
local 10.0.0.54 proto kernel scope host src 10.0.0.54
# uname -r
5.4.0-1057-gcp
ADD 2021-11-20
Now it become a known issue: [Cloud Networking] Potential Service Issue: Investigating
Google Cloud Global TCP Proxy Load Balancers may be unable to serve
traffic over forwarding rules configured with IPs in the 34.111.0.0/17
range. A permanent fix for the IP range is In Progress