Google LB failed to check ingress-nginx pods healthz sporadically

Question

Score:0

Server

Google LB failed to check ingress-nginx pods healthz sporadically

Lord-Y

8/27/23, 4:15 PM

It's been weeks since my I'm having a lot of timeout when gcp lbs check ingress-nginx healthz while everything respond correctly.

I'm having a GKE cluster with Container Optimized OS and n1-standard-4 as machine and kubernetes version v1.21.10-gke.2000.

Here are my nodes:

kubectl top no
NAME                                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-xxx-gke-cluster0-xxx-gke-cluster0-0a2ef32c-6lj0   821m         20%    3683Mi          29%       
gke-xxx-gke-cluster0-xxx-gke-cluster0-98567a10-pqk2   2302m        58%    4983Mi          40%       
gke-xxx-gke-cluster0-xxx-gke-cluster0-cd892740-3v6m   83m          2%     852Mi           6%

Here are my ingress-nginx pods and services:

NAME                                 READY   STATUS    RESTARTS   AGE
pod/nginx-ingress-controller-fnxlc   1/1     Running   0          65m
pod/nginx-ingress-controller-m4nq2   1/1     Running   0          67m
pod/nginx-ingress-controller-tb4gc   1/1     Running   0          66m

NAME                                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/nginx-ingress-controller           NodePort    REDACTED   <none>        80:32080/TCP,443:32443/TCP   69d
service/nginx-ingress-controller-metrics   ClusterIP   REDACTED    <none>        10254/TCP                    69d

Here are my helm values for ngress-nginx/ingress-nginx version 4.1.0:

  ingressClassResource:
    name: nginx
    enabled: true
    default: false
    controllerValue: "k8s.io/ingress-nginx"

  kind: DaemonSet

  livenessProbe:
    httpGet:
      path: "/healthz"
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    timeoutSeconds: 1
    successThreshold: 1
    failureThreshold: 5
  readinessProbe:
    httpGet:
      path: "/healthz"
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    timeoutSeconds: 1
    successThreshold: 1
    failureThreshold: 3
  podAnnotations:
    prometheus.io/scrape_metrics_app: "true"
    prometheus.io/scrape_metrics_port_app: "10254"
    prometheus.io/scrape_metrics_port_name_app: metrics
  resources:
    requests:
      cpu: 100m
      memory: 120Mi

  service:
    enabled: true

    annotations:
      cloud.google.com/backend-config: '{"ports": {"80":"security-policy"}}'

    targetPorts:
      http: http
      https: https

    type: NodePort
    nodePorts:
      http: 32080
      https: 32443
      tcp:
        8080: 32808

  metrics:
    port: 10254
    # if this port is changed, change healthz-port: in extraArgs: accordingly
    enabled: true
  priorityClassName: nginx-ingress

  admissionWebhooks:
    enabled: false
    patch:
      priorityClassName: nginx-ingress

My backend config:

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: nginx-ingress
value: 1000000
globalDefault: false
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: security-policy
spec:
  timeoutSec: 60
  connectionDraining:
    drainingTimeoutSec: 10
  securityPolicy:
    name: "REDACTED"
  healthCheck:
    checkIntervalSec: 10
    timeoutSec: 5
    healthyThreshold: 1
    unhealthyThreshold: 2
    port: 32080
    type: HTTP
    requestPath: /healthz
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-ingress-controller-gke
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "REDACTED"
    kubernetes.io/ingress.class: "gce"
spec:
  ingressClassName: nginx
  defaultBackend:
    service:
      name: nginx-ingress-controller
      port:
        number: 80

My firewall rule:

allowed:
- IPProtocol: tcp
  ports:
  - '32080'
  - '80'
creationTimestamp: 'REDACTED'
description: ''
direction: INGRESS
disabled: false
id: 'REDACTED'
kind: compute#firewall
logConfig:
  enable: false
name: REDACTED-allow-i-google-gke-health
network: https://www.googleapis.com/compute/v1/projects/REDACTED/global/networks/REDACTED
priority: 1000
selfLink: https://www.googleapis.com/compute/v1/projects/REDACTED/global/firewalls/REDACTED-allow-i-google-gke-health
sourceRanges:
- 130.211.0.0/22
- 35.191.0.0/16
targetServiceAccounts:
- REDACTED

My backend service is HEALTHY:

---
backend: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-b/instanceGroups/k8s-ig--REDACTED
status:
  healthStatus:
  - healthState: HEALTHY
    instance: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-b/instances/REDACTED
    ipAddress: REDACTED
    port: 32080
  kind: compute#backendServiceGroupHealth
---
backend: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-c/instanceGroups/k8s-ig--REDACTED
status:
  healthStatus:
  - healthState: HEALTHY
    instance: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-c/instances/REDACTED
    ipAddress: REDACTED
    port: 32080
  kind: compute#backendServiceGroupHealth
---
backend: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-d/instanceGroups/k8s-ig--REDACTED
status:
  healthStatus:
  - healthState: HEALTHY
    instance: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-d/instances/REDACTED
    ipAddress: REDACTED
    port: 32080
  kind: compute#backendServiceGroupHealth

My target http/https proxies are OKAY.

The problem is, since GKE 1.21 I'm having a lot of health check timeout from google lb:

{
  "insertId": "120vrdac2cf",
  "jsonPayload": {
    "healthCheckProbeResult": {
      "healthCheckProtocol": "HTTP",
      "healthState": "UNHEALTHY",
      "previousHealthState": "HEALTHY",
      "probeResultText": "HTTP response: , Error: Timeout waiting for connect",
      "probeSourceIp": "35.191.13.216",
      "ipAddress": "REDACTED",
      "probeCompletionTimestamp": "2022-04-27T15:40:52.868912018Z",
      "previousDetailedHealthState": "HEALTHY",
      "targetIp": "REDACTED",
      "detailedHealthState": "TIMEOUT",
      "responseLatency": "5.001074s",
      "targetPort": 32080,
      "probeRequest": "/healthz"
    }
  },
  "resource": {
    "type": "gce_instance_group",
    "labels": {
      "instance_group_name": "k8s-ig--d350a72156e88e7d",
      "instance_group_id": "7274987390644036118",
      "location": "europe-west1-c",
      "project_id": "REDACTED"
    }
  },
  "timestamp": "2022-04-27T15:40:53.307035382Z",
  "severity": "INFO",
  "logName": "projects/REDACTED/logs/compute.googleapis.com%2Fhealthchecks",
  "receiveTimestamp": "2022-04-27T15:40:54.568716762Z"
}

Here is a screenshot of all errors: health check errors

I have no firewall issues. From a node, no health check issues: while true; do curl -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n" 0:32080/healthz; done

200 0.000984s
200 0.000845s
200 0.000704s
200 0.002411s
200 0.001235s
200 0.000784s
200 0.001471s
200 0.000498s

The http response is always 200. All of this means both gke and pods are healthy. If pods were not healthy, I will have some restarts which I don't have at all. My pods health checks always respond in milliseconds.

But for some unknown reason, I'm having a lot of healthcheck issues Timeout waiting for connect which are causing traffic issues on my website.

During my debuging, I'm having no traffic on my website.

I don't remember having any issues with GKE 1.19/1.20. I of course tried many versions of 1.21 but still no luck.

I switched from ingress-nginx 4.0.16 to 4.1.0 but the issue is still present.

I also increased the health check interval and timeout but same problem.

I was like maybe nginx is reloading his config a lot of times but it's not actually not the case because in logs are pretty much the same:

nginx-ingress-controller-fnxlc controller I0427 16:10:38.352350       8 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"nginx-ingress", Name:"nginx-ingress-controller-gke", UID:"45baf918-c5b9-499e-9930-b6e5d03aa38e", APIVersion:"networking.k8s.io/v1", ResourceVersion:"83550719", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync

Does anyone have the same problem? Any help?

21

0 + 0

google-compute-engine

kubernetes

google-cloud-platform

google-cloud-network-load-balancer

nginx-ingress

Google LB failed to check ingress-nginx pods healthz sporadically

Post an answer