We have a NodePool dedicated to CI agents. When everything works properly, our CI controller will create a pod for a CI agent, and the NodePool will be scaled automatically by GCP's autoscaler. This means that the pods will have the following event saying that no nodes match their affinities:
0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector.
And the new Nodes will eventually be online after a short time. However, most of the time, the autoscaler will fail saying that:
pod didn't trigger scale-up: 3 Insufficient ephemeral-storage, 6 node(s) didn't match Pod's node affinity/selector
When this occurs, I have to manually scale the NodePool through the GCP's UI on the NodePool section, which works immediately.
I'm pretty confident saying that there is a bug somewhere between Kubernetes and GCP's infrastructure, maybe the autoscaler. What do you think?
Here is the configuration of the NodePool, if it can help:
autoscaling:
enabled: true
maxNodeCount: 3
config:
diskSizeGb: 100
diskType: pd-standard
ephemeralStorageConfig:
localSsdCount: 2
imageType: COS_CONTAINERD
labels:
_redacted_: 'true'
machineType: c2-standard-16
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/cloud-platform
preemptible: true
serviceAccount: _redacted_
shieldedInstanceConfig:
enableIntegrityMonitoring: true
tags:
- gke-main
taints:
- effect: NO_SCHEDULE
key: _redacted_
value: 'true'
workloadMetadataConfig:
mode: GKE_METADATA
initialNodeCount: 1
instanceGroupUrls:
- _redacted_
locations:
- europe-west1-c
- europe-west1-b
- europe-west1-d
management:
autoRepair: true
autoUpgrade: true
maxPodsConstraint:
maxPodsPerNode: '110'
name: gha
networkConfig:
podIpv4CidrBlock: 10.0.0.0/17
podRange: main-europe-west1-pods
podIpv4CidrSize: 24
selfLink: _redacted_
status: RUNNING
upgradeSettings:
maxSurge: 1
version: 1.21.11-gke.900
Thanks!