Score:1

load balancer mark "unhealty" new group member instance (ubuntu) after dist-upgrade

tl flag

I've a some VMs (working as web server) behind an instance group on my GCloud.

As usual maintenance I've update (apt dist-upgrade) my "vm-source-image", created a new template and add it to my group.

The new members using this template never receive any real working request from the load-balancer and it is up and running but unemployed.

Temporary patch

I do only a partial update (the security ones) by:

sudo unattended-upgrade -d

Here the list of remaining packages that create the problem:

# apt list --upgradable

cloud-init/bionic-updates 21.3-1-g6803368d-0ubuntu1~18.04.4 all [upgradable from: 21.2-3-g899bfaa9-0ubuntu2~18.04.1]
dnsmasq-base/bionic-updates 2.79-1ubuntu0.5 amd64 [upgradable from: 2.79-1ubuntu0.4]
gce-compute-image-packages/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0]
google-compute-engine/bionic-updates 20210629.00-0ubuntu1~18.04.0 all [upgradable from: 20201222.00-0ubuntu2~18.04.0]
google-compute-engine-oslogin/bionic-updates 20210728.00-0ubuntu1~18.04.0 amd64 [upgradable from: 20210429.00-0ubuntu1~18.04.0]
google-guest-agent/bionic-updates 20210629.00-0ubuntu1~18.04.1 amd64 [upgradable from: 20210414.00-0ubuntu1~18.04.0]
libgnutls30/bionic-updates 3.5.18-1ubuntu1.5 amd64 [upgradable from: 3.5.18-1ubuntu1.4]
libnetplan0/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4]
libpcre2-8-0/bionic 10.39-1+ubuntu18.04.1+deb.sury.org+1 amd64 [upgradable from: 10.36-2+ubuntu18.04.1+deb.sury.org+2]
netplan.io/bionic-updates 0.99-0ubuntu3~18.04.5 amd64 [upgradable from: 0.99-0ubuntu3~18.04.4]
nplan/bionic-updates 0.99-0ubuntu3~18.04.5 all [upgradable from: 0.99-0ubuntu3~18.04.4]
snapd/bionic-updates 2.51.1+18.04 amd64 [upgradable from: 2.49.2+18.04]
ubuntu-advantage-tools/bionic-updates 27.3~18.04.1 amd64 [upgradable from: 27.2.2~18.04.1]

A REAL SOLUTION

Since I've no "custom" package on the machine and the origin of this problem comes from a system update I see no solution except point-out the problem by this post.

Of course I'm monitoring the new updates hoping that a new version of this packages solve the problem, but it's possible ther're no better options?

More info

  • The group is the backend of an "internal TCP load-balancer".
  • The frontend IP address of the load-balancer is 10.0.0.116
  • The old (and working) member IP address is 10.0.0.48 (seeable the logs)
  • The new (and unemployed) member IP address is 10.0.0.54 (seeable the logs)
  • The load-balancer has a simple HTTP health-check know as HTTPHC1.
  • The instance-group has another simple HTTP health-check know as HTTPHC2.

Comparing the access log of an old (and working) member with the new one:

Log of a old VM member

35.191.1.148 "/" - - - [04/Nov/2021:10:34:59 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.144 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.154 "/" - - - [04/Nov/2021:10:35:00 +0000] 10.0.0.48 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.147 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.145 "/" - - - [04/Nov/2021:10:35:01 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.151 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.153 "/" - - - [04/Nov/2021:10:35:02 +0000] 10.0.0.48 "GET /?id=HTTPHC1 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"

Log of a new VM member

35.191.1.152 "/" - - - [04/Nov/2021:10:31:01 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.154 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"
35.191.1.148 "/" - - - [04/Nov/2021:10:31:02 +0000] 10.0.0.54 "GET /?id=HTTPHC2 HTTP/1.1" 200 612 "-" "GoogleHC/1.0"

The difference shows the missing of the logs of HTTPHC1.

So the new new don't answer to the health check of the load-balancer (HTTPHC1) and don't receive requests and that's the problem.

Other malfunctions The new machine is also unaccessible by browser-window-SSH enter image description here

ADD tcpdump

Between HTTPHC1 health-checker and unemployed member:

# tcpdump -n host 35.191.1.151
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
11:30:35.109469 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:36.119470 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:38.167436 IP 35.191.1.151.61838 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:40.110784 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0
11:30:41.111176 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:43.159164 IP 35.191.1.151.59900 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
11:30:45.112162 IP 35.191.1.151.36064 > 10.0.0.116.80: Flags [S], win 65535, options [mss 1420,sackOK,TS  ecr 0,nop,wscale 8], length 0

Note that the destination is load-balancer frontend IP: 10.0.0.116 and of course they're only Sync packets.

Between HTTPHC2 health-checker and unemployed member:

# tcpdump -n host 35.191.1.148
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
10:46:12.475724 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [S], win 65535, options [mss 1420,sackOK,TS ecr 0,nop,wscale 8], length 0
10:46:12.475788 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [S.], win 64768, options [mss 1420,sackOK,TS,nop,wscale 7], length 0
10:46:12.476239 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 1, win 256, options [nop,nop,TS], length 0
10:46:12.476239 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [P.], seq 1:117, ack 1, win 256, options [nop,nop,TS], length 116: HTTP: GET /?id=HTTPHC2 HTTP/1.1
10:46:12.476301 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [.], ack 117, win 506, options [nop,nop,TS], length 0
10:46:12.476546 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [P.], seq 1:867, ack 117, win 506, options [nop,nop,TS], length 866: HTTP: HTTP/1.1 200 OK
10:46:12.476659 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 867, win 267, options [nop,nop,TS], length 0
10:46:12.476679 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [F.], seq 117, ack 867, win 267, options [nop,nop,TS], length 0
10:46:12.476707 IP 10.0.0.54.80 > 35.191.1.148.64638: Flags [F.], seq 867, ack 118, win 506, options [nop,nop,TS], length 0
10:46:12.476879 IP 35.191.1.148.64638 > 10.0.0.54.80: Flags [.], ack 868, win 267, options [nop,nop,TS], length 0

Here everything is fine.

ADD 2021-11-16

After some research I found a missing IP alias in local table, no surprise to see that is the frontend-load-balancer IP address, visible as the DST host in the tcpdump!

Here the working machine:

# ip route show dev ens4 table local
local 10.0.0.48 proto kernel scope host src 10.0.0.48 
local 10.0.0.116 proto 66 scope host 
# uname -r
5.4.0-1056-gcp

And here the Fully Updated machine:

# ip route show dev ens4 table local
local 10.0.0.54 proto kernel scope host src 10.0.0.54
# uname -r
5.4.0-1057-gcp

ADD 2021-11-20

Now it become a known issue: [Cloud Networking] Potential Service Issue: Investigating

Google Cloud Global TCP Proxy Load Balancers may be unable to serve traffic over forwarding rules configured with IPs in the 34.111.0.0/17 range. A permanent fix for the IP range is In Progress

Wojtek_B avatar
jp flag
Are the new VM's accessible from other VM's in the same VPC ? How did you log in to your new VM ?
tl flag
@Wojtek_B the VM is well accessible through his IP (10.0.0.54). it's the LB (IMO the frontend component) that don't know the real IP of the machine.
Wojtek_B avatar
jp flag
I have a suspicion that the culprit here is the [Netplan](https://netplan.io/) which I'm not familiar with but since it's a networking utility feature and after an upgrade you lost VM's external IP and one of the health checks is failing. Check your `/etc/netplan/*.yaml` files before and after an upgrade - are they chenged ?
Wojtek_B avatar
jp flag
You can always try to create another health check that will work and change it in the load-balancer's settings.
tl flag
@Wojtek_B if the goal to found the guilty package, yes, checking `/etc/netplan/*.yaml` could be a solution, but my goal is to solve the problem keeping the clean approach possible, eg: create a new machine with ubuntu-20 (should be better if ubuntu-22), or uninstall un-useful package XYZK that is the real origin of the problem.
tl flag
@Wojtek_B I don't think is possible by-pass the lack of knoledge of the "group-member-IP" inside the balancer with any real health-check. :(
Wojtek_B avatar
jp flag
Can you try and do the upgrade, but keep the old versions of `libnetplan0`, `netplan.io` and `nplan` packages ?
tl flag
Hi @Wojtek_B, I've upgrade the system except the `*netplan*` packages, unfortunately I've the problem, they are not the "troublemakers"
Wojtek_B avatar
jp flag
Maybe just try to install them one by one and check if this "breaks" the configuration. It seems pretty quick solution since there are just a few of them.
tl flag
Not so quick, it's imply the full deployment process: power-on-> update -> poweroff-> image->disk->template->deploy + or - 15/20 minute a package. Ok, not building Rome but not so quick
Wojtek_B avatar
jp flag
You can always try installing first half and if after the restar everything works you already know that you have to look for the culprint in the other half. Split it in two again and repeat the process.
tl flag
@Wojtek_B the always good b-tree approach :D I'll give a try tomorrow
tl flag
@Wojtek_B what do you think of my last *add*?
Wojtek_B avatar
jp flag
Good work nailing it down - did you add the route ? `ip route add to local IP_HERE dev ens4 proto 66`
tl flag
@Wojtek_B I just read the answer of EthanWang and now I prefer to know the answer of his question: "why google-guest-agent is not running automatically" ;)
Wojtek_B avatar
jp flag
This looks like the issue with the logging agent and it would be best to report it on [Google's IssueTracker](issuetracker.google.com). I tried reproducing it with simple Ubuntu 16.04 instance and ran sudo apt upgrade with no issues.
Score:3
gb flag

After testing, cloud-init is the root cause.

According to this comment, disable_network_activation: true should be set to avoid conflict with the google-guest-agent service.

The solution is adding the setting in cloud-init config.

cat > /etc/cloud/cloud.cfg.d/99-disable-network-activation.cfg <<EOF
# Disable network activation to prevent \`cloud-init\` from making network
# changes that conflict with \`google-guest-agent\`.
# See: https://github.com/canonical/cloud-init/pull/1048

disable_network_activation: true
EOF

This file exist in the official image ubuntu-1804-bionic-v20211103.

After adding this file, the google-guest-agent is running normally.

tl flag
I think you did very well, found the solution and creating the bash path (works like a charm). Good job!
Score:0
cn flag

I have a machine running Ubuntu 18.04.5, met the same problem after running apt dist-upgrade, also upgrade google-guest-agent 20210629.00-0ubuntu1~18.04.1 (upgradable from: 20210414.00-0ubuntu1~18.04.0).

Finding that google-guest-agent is not running after upgrade. When I execute /usr/bin/google_guest_agent manually, the problem is solved.

Still don't know why google-guest-agent is not running automatically.

tl flag
Thanks @Ethan, I'll turn your info to google support and I'll keep you update
tl flag
I wonder why this problem is not pervasive. Could be because happens only on "customized" system, so for example I disabled "apt-daily.service". Same for you?
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.