It's been weeks since my I'm having a lot of timeout when gcp lbs check ingress-nginx healthz while everything respond correctly.
I'm having a GKE cluster with Container Optimized OS and n1-standard-4 as machine and kubernetes version v1.21.10-gke.2000
.
Here are my nodes:
kubectl top no
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-xxx-gke-cluster0-xxx-gke-cluster0-0a2ef32c-6lj0 821m 20% 3683Mi 29%
gke-xxx-gke-cluster0-xxx-gke-cluster0-98567a10-pqk2 2302m 58% 4983Mi 40%
gke-xxx-gke-cluster0-xxx-gke-cluster0-cd892740-3v6m 83m 2% 852Mi 6%
Here are my ingress-nginx pods and services:
NAME READY STATUS RESTARTS AGE
pod/nginx-ingress-controller-fnxlc 1/1 Running 0 65m
pod/nginx-ingress-controller-m4nq2 1/1 Running 0 67m
pod/nginx-ingress-controller-tb4gc 1/1 Running 0 66m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/nginx-ingress-controller NodePort REDACTED <none> 80:32080/TCP,443:32443/TCP 69d
service/nginx-ingress-controller-metrics ClusterIP REDACTED <none> 10254/TCP 69d
Here are my helm values for ngress-nginx/ingress-nginx
version 4.1.0
:
ingressClassResource:
name: nginx
enabled: true
default: false
controllerValue: "k8s.io/ingress-nginx"
kind: DaemonSet
livenessProbe:
httpGet:
path: "/healthz"
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: "/healthz"
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
podAnnotations:
prometheus.io/scrape_metrics_app: "true"
prometheus.io/scrape_metrics_port_app: "10254"
prometheus.io/scrape_metrics_port_name_app: metrics
resources:
requests:
cpu: 100m
memory: 120Mi
service:
enabled: true
annotations:
cloud.google.com/backend-config: '{"ports": {"80":"security-policy"}}'
targetPorts:
http: http
https: https
type: NodePort
nodePorts:
http: 32080
https: 32443
tcp:
8080: 32808
metrics:
port: 10254
# if this port is changed, change healthz-port: in extraArgs: accordingly
enabled: true
priorityClassName: nginx-ingress
admissionWebhooks:
enabled: false
patch:
priorityClassName: nginx-ingress
My backend config:
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: nginx-ingress
value: 1000000
globalDefault: false
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: security-policy
spec:
timeoutSec: 60
connectionDraining:
drainingTimeoutSec: 10
securityPolicy:
name: "REDACTED"
healthCheck:
checkIntervalSec: 10
timeoutSec: 5
healthyThreshold: 1
unhealthyThreshold: 2
port: 32080
type: HTTP
requestPath: /healthz
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: nginx-ingress-controller-gke
annotations:
kubernetes.io/ingress.global-static-ip-name: "REDACTED"
kubernetes.io/ingress.class: "gce"
spec:
ingressClassName: nginx
defaultBackend:
service:
name: nginx-ingress-controller
port:
number: 80
My firewall rule:
allowed:
- IPProtocol: tcp
ports:
- '32080'
- '80'
creationTimestamp: 'REDACTED'
description: ''
direction: INGRESS
disabled: false
id: 'REDACTED'
kind: compute#firewall
logConfig:
enable: false
name: REDACTED-allow-i-google-gke-health
network: https://www.googleapis.com/compute/v1/projects/REDACTED/global/networks/REDACTED
priority: 1000
selfLink: https://www.googleapis.com/compute/v1/projects/REDACTED/global/firewalls/REDACTED-allow-i-google-gke-health
sourceRanges:
- 130.211.0.0/22
- 35.191.0.0/16
targetServiceAccounts:
- REDACTED
My backend service is HEALTHY:
---
backend: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-b/instanceGroups/k8s-ig--REDACTED
status:
healthStatus:
- healthState: HEALTHY
instance: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-b/instances/REDACTED
ipAddress: REDACTED
port: 32080
kind: compute#backendServiceGroupHealth
---
backend: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-c/instanceGroups/k8s-ig--REDACTED
status:
healthStatus:
- healthState: HEALTHY
instance: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-c/instances/REDACTED
ipAddress: REDACTED
port: 32080
kind: compute#backendServiceGroupHealth
---
backend: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-d/instanceGroups/k8s-ig--REDACTED
status:
healthStatus:
- healthState: HEALTHY
instance: https://www.googleapis.com/compute/v1/projects/REDACTED/zones/europe-west1-d/instances/REDACTED
ipAddress: REDACTED
port: 32080
kind: compute#backendServiceGroupHealth
My target http/https proxies are OKAY.
The problem is, since GKE 1.21 I'm having a lot of health check timeout from google lb:
{
"insertId": "120vrdac2cf",
"jsonPayload": {
"healthCheckProbeResult": {
"healthCheckProtocol": "HTTP",
"healthState": "UNHEALTHY",
"previousHealthState": "HEALTHY",
"probeResultText": "HTTP response: , Error: Timeout waiting for connect",
"probeSourceIp": "35.191.13.216",
"ipAddress": "REDACTED",
"probeCompletionTimestamp": "2022-04-27T15:40:52.868912018Z",
"previousDetailedHealthState": "HEALTHY",
"targetIp": "REDACTED",
"detailedHealthState": "TIMEOUT",
"responseLatency": "5.001074s",
"targetPort": 32080,
"probeRequest": "/healthz"
}
},
"resource": {
"type": "gce_instance_group",
"labels": {
"instance_group_name": "k8s-ig--d350a72156e88e7d",
"instance_group_id": "7274987390644036118",
"location": "europe-west1-c",
"project_id": "REDACTED"
}
},
"timestamp": "2022-04-27T15:40:53.307035382Z",
"severity": "INFO",
"logName": "projects/REDACTED/logs/compute.googleapis.com%2Fhealthchecks",
"receiveTimestamp": "2022-04-27T15:40:54.568716762Z"
}
Here is a screenshot of all errors:
health check errors
I have no firewall issues.
From a node, no health check issues:
while true; do curl -m 2 -o /dev/null -sw "%{http_code} %{time_total}s\n" 0:32080/healthz; done
200 0.000984s
200 0.000845s
200 0.000704s
200 0.002411s
200 0.001235s
200 0.000784s
200 0.001471s
200 0.000498s
The http response is always 200.
All of this means both gke and pods are healthy. If pods were not healthy, I will have some restarts which I don't have at all.
My pods health checks always respond in milliseconds.
But for some unknown reason, I'm having a lot of healthcheck issues Timeout waiting for connect
which are causing traffic issues on my website.
During my debuging, I'm having no traffic on my website.
I don't remember having any issues with GKE 1.19/1.20
. I of course tried many versions of 1.21 but still no luck.
I switched from ingress-nginx 4.0.16
to 4.1.0
but the issue is still present.
I also increased the health check interval and timeout but same problem.
I was like maybe nginx is reloading his config a lot of times but it's not actually not the case because in logs are pretty much the same:
nginx-ingress-controller-fnxlc controller I0427 16:10:38.352350 8 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"nginx-ingress", Name:"nginx-ingress-controller-gke", UID:"45baf918-c5b9-499e-9930-b6e5d03aa38e", APIVersion:"networking.k8s.io/v1", ResourceVersion:"83550719", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
Does anyone have the same problem?
Any help?