We have a pretty simple setup for the VM on GCP w/o public IP address. To reach the internet, we use cloud NAT (w/ the basic configuration, see attached image):
data:image/s3,"s3://crabby-images/02e60/02e60f9f2caaded281076358ce4a01a6c80b8722" alt="enter image description here"
The problem we have is that the VM loses the internet connection:
- we can not access it using SSH
- based on the syslog VM can not access GCE metadata server (
OSConfigAgent[514]: 2023-03-10T15:49:41.8034Z OSConfigAgent Error main.go:231: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=2a783d496d54f634&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
)
The only solution to this case is to restart the VM & network starts to work. The 2nd log is continuously repeated after something happens. On the other hand we have preceding logs:
systemd-networkd[501671]: ens4: Could not set DHCPv4 address: Connection timed out
systemd-networkd[501671]: ens4: Failed
kernel: [1118386.615077] systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Initially we suspected that the problem may be related to the Cloud NAT, but we do not have any evidence to prove and handle that, because in the NAT logs (errors & transactions) there are no significant errors.
The main idea of this question would be to avoid or handle the such situation automatically, w/o manual intervention. Please let me know if additional information is required.