Score:0

GKE Node auto-provisioning not scaling up with limits defined

us flag

I want to use GKE node auto-provisioning to create a node-pool with GPU on demand (that is when I start a Job that needs GPU resources).

Going with the GCP tutorial I've set up a cluster with enabled cluster autoscaling and node auto-provisioning. NAP has set up limits for CPU, Memory and GPU:

resourceLimits:
  - maximum: '15'
    minimum: '1'
    resourceType: cpu
  - maximum: '150'
    minimum: '1'
    resourceType: memory
  - maximum: '2'
    resourceType: nvidia-tesla-k80

I know that NAP works because it already spun up a few nodes for me, but all of them were "normal ones" (without GPU).

Now, to "force" NAP to create node-pool with GPU machine. Prior to that, no GPU node exists on the cluster. To do that, I'm creating a Job with such a configuration file:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
spec:
  ttlSecondsAfterFinished: 100
  template:
    metadata:
      name: training-job
    spec:
      nodeSelector:
        gpu: "true"
        cloud.google.com/gke-spot: "true"
        cloud.google.com/gke-accelerator: nvidia-tesla-k80
      tolerations:
        - key: cloud.google.com/gke-spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      containers:
        - name: gpu-test
          image: przomys/gpu-test
          resources:
            requests:
              cpu: 500m
            limits:
              nvidia.com/gpu: 2 # requesting 2 GPU
      restartPolicy: Never # Do not restart containers after they exit

Job is being created, but then it is marked as "Unschedulable" and CA Log gives me such error:

{
  "noDecisionStatus": {
    "measureTime": "1650370630",
    "noScaleUp": {
      "unhandledPodGroups": [
        {
          "rejectedMigs": [
            {
              "reason": {
                "messageId": "no.scale.up.mig.failing.predicate",
                "parameters": [
                  "NodeAffinity",
                  "node(s) didn't match Pod's node affinity/selector"
                ]
              },
              "mig": {
                "zone": "us-central1-c",
                "nodepool": "pool-3",
                "name": "gke-cluster-activeid-pool-3-af526144-grp"
              }
            },
            {
              "mig": {
                "name": "gke-cluster-activeid-nap-e2-standard--c7a4d4f1-grp",
                "zone": "us-central1-c",
                "nodepool": "nap-e2-standard-2-w52e84k8"
              },
              "reason": {
                "parameters": [
                  "NodeAffinity",
                  "node(s) didn't match Pod's node affinity/selector"
                ],
                "messageId": "no.scale.up.mig.failing.predicate"
              }
            }
          ],
          "napFailureReasons": [
            {
              "parameters": [
                "Any GPU."
              ],
              "messageId": "no.scale.up.nap.pod.gpu.no.limit.defined"
            }
          ],
          "podGroup": {
            "totalPodCount": 1,
            "samplePod": {
              "controller": {
                "apiVersion": "batch/v1",
                "kind": "Job",
                "name": "training-job"
              },
              "namespace": "default",
              "name": "training-job-7k8zd"
            }
          }
        }
      ],
      "unhandledPodGroupsTotalCount": 1
    }
  }
}

My guess is that no.scale.up.nap.pod.gpu.no.limit.defined is the most important part. GCP tutorial points me here. But I have this limit defined, thus I'm out of ideas...

Maybe someone has an idea what I'm doing wrong?

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.